<<

Data Mining This book is a part of the course by Jaipur National University, Jaipur. This book contains the course content for Mining.

JNU, Jaipur First Edition 2013

The content in the book is copyright of JNU. All rights reserved. No part of the content may in any form or by any electronic, mechanical, photocopying, recording, or any other means be reproduced, stored in a retrieval system or be broadcast or transmitted without the prior permission of the publisher.

JNU makes reasonable endeavours to ensure content is current and accurate. JNU reserves the right to alter the content whenever the need arises, and to vary it at any time without prior notice. Index

I. Content...... II

II. List of Figures...... VII

III. List of Tables...... IX

IV. Abbreviations...... X

V. Case Study...... 159

VI. Bibliography...... 175

VII. Self Assessment Answers...... 178

Book at a Glance

I/JNU OLE Contents

Chapter I...... 1 – Need, Planning and Architecture...... 1 Aim...... 1 Objectives...... 1 Learning outcome...... 1 1.1 Introduction ...... 2 1.2 Need for Data Warehousing...... 4 1.3 Basic Elements of Data Warehousing...... 5 1.4 Project Planning and Management...... 6 1.5 Architecture and Infrastructure...... 8 1.5.1 Infrastructure...... 11 1.5.2 ...... 13 1.5.3 Metadata Components...... 14 Summary...... 17 References...... 17 Recommended Reading...... 17 Self Assessment...... 18

Chapter II...... 20 Data Design and Data Representation...... 20 Aim...... 20 Objectives...... 20 Learning outcome...... 20 2.1 Introduction...... 21 2.2 Design Decision...... 21 2.3 Use of CASE Tools...... 21 2.4 ...... 23 2.4.1 Review of a Simple STAR Schema...... 23 2.4.2 Star Schema Keys...... 24 2.5 Dimensional Modelling...... 26 2.5.1 E-R Modelling versus Dimensional Modelling...... 26 2.6 Data Extraction...... 26 2.6.1 Source Identification...... 27 2.6.2 Data Extraction Techniques...... 28 2.6.3 Data in Operational Systems...... 28 2.7 ...... 33 2.7.1 Major Transformation Types...... 34 2.7.2 and Consolidation...... 36 2.7.3 Implementing Transformation...... 37 2.8 Data Loading...... 38 2.9 ...... 39 2.10 Information Access and Delivery...... 40 2.11 Matching Information to Classes of Users OLAP in Data Warehouse...... 40 2.11.1 Information from the Data Warehouse...... 41 2.11.2 Information Potential...... 41 Summary...... 43 References...... 43 Recommended Reading...... 43 Self Assessment...... 44

II/JNU OLE Chapter III...... 46 ...... 46 Aim...... 46 Objectives...... 46 Learning outcome...... 46 3.1 Introduction...... 47 3.2 Crucial Concepts of Data Mining...... 48 3.2.1 Bagging (Voting, Averaging)...... 48 3.2.2 Boosting...... 49 3.2.3 (in Data Mining)...... 49 3.2.4 (for Data Mining)...... 49 3.2.5 Deployment...... 49 3.2.6 Drill-Down Analysis...... 50 3.2.7 Feature Selection...... 50 3.2.8 Machine Learning...... 50 3.2.9 Meta-Learning...... 50 3.2.10 Models for Data Mining...... 50 3.2.11 Predictive Data Mining...... 52 3.2.12 ...... 52 3.3 Cross-Industry Standard Process: Crisp–Dm...... 52 3.3.1 CRISP-DM: The Six Phases...... 53 3.4 Data Mining Techniques...... 55 3.5 Graph Mining...... 55 3.6 Social Network Analysis...... 56 3.6.1 Characteristics of Social Networks...... 56 3.6.2 Mining on Social Networks...... 57 3.7 Multirelational Data Mining...... 59 3.8 Data Mining Algorithms and their Types...... 60 3.8.1 Classification...... 61 3.8.2 Clustering...... 69 3.8.3 Association Rules...... 77 Summary...... 80 References...... 80 Recommended Reading...... 81 Self Assessment...... 82

Chapter IV...... 84 Web Application of Data Mining...... 84 Aim...... 84 Objectives...... 84 Learning outcome...... 84 4.1 Introduction...... 85 4.2 Goals of Data Mining and Knowledge Discovery...... 86 4.3 Types of Knowledge Discovered during Data Mining...... 86 4.4 Knowledge Discovery Process...... 87 4.4.1 Overview of Knowledge Discovery Process...... 88 4.5 Web Mining...... 90 4.5.1 Web Analysis...... 90 4.5.2 Benefits of Web mining...... 91 4.6 Web Content Mining...... 91 4.7 Web StructureMining...... 92 4.8 Web Usage Mining...... 93

III/JNU OLE Summary...... 95 References...... 95 Recommended Reading...... 95 Self Assessment...... 96

Chapter V...... 98 Advance topics of Data Mining...... 98 Aim...... 98 Objectives...... 98 Learning outcome...... 98 5.1 Introduction...... 99 5.2 Concepts...... 99 5.2.1 Mechanism...... 100 5.2.2 Knowledge to be Discovered...... 100 5.3 Techniques of SDMKD...... 101 5.3.1 SDMKD-based Image Classification...... 103 5.3.2 Cloud Model...... 104 5.3.3 Data Fields...... 105 5.4 Design- and Model-based Approaches to Spatial Sampling...... 106 5.4.1 Design-based Approach to Sampling...... 106 5.4.2 Model-based Approach to Sampling...... 107 5.5 Temporal Mining...... 107 5.5.1 Time in Data Warehouses...... 108 5.5.2 Temporal Constraints and Temporal Relations...... 108 5.5.3 Requirements for a Temporal Knowledge-Based Management System...... 108 5.6 Mediators...... 108 5.6.1 Temporal Relation Discovery...... 109 5.6.2 Semantic Queries on Temporal Data...... 109 5.7 Temporal Data Types...... 110 5.8 Temporal Data Processing...... 110 5.8.1 Data Normalisation...... 111 5.9 Temporal Event Representation...... 111 5.9.1 Event Representation Using Markov Models...... 111 5.9.2 A Formalism for Temporal Objects and Repetitions...... 112 5.10 Classification Techniques...... 112 5.10.1 Distance-Based Classifier...... 112 5.10.2 Bayes Classifier...... 112 5.10.3 Decision Tree...... 112 5.10.4 Neural Networks in Classification...... 113 5.11 Sequence Mining...... 113 5.11.1 Apriori Algorithm and Its Extension to Sequence Mining...... 113 5.11.2 The GSP Algorithm...... 114 Summary...... 115 References...... 115 Recommended Reading...... 115 Self Assessment...... 116

Chapter VI...... 118 Application and Trends of Data Mining...... 118 Aim...... 118 Objectives...... 118 Learning outcome...... 118 6.1 Introduction...... 119 6.2 Applications of Data Mining...... 119 6.2.1 Aggregation and Approximation in Spatial and Multimedia Data Generalisation...... 119

IV/JNU OLE 6.2.2 Generalisation of Object Identifiers and Class/Subclass Hierarchies...... 119 6.2.3 Generalisation of Class Composition Hierarchies...... 120 6.2.4 Construction and Mining of Object Cubes...... 120 6.2.5 Generalisation-Based Mining of Plan by Divide-and-Conquer...... 120 6.3 Spatial Data Mining...... 120 6.3.1 Spatial Data Cube Construction and Spatial OLAP...... 121 6.3.2 Mining Spatial Association and Co-location Patterns...... 121 6.3.3 Mining Raster Databases...... 121 6.4 Multimedia Data Mining...... 121 6.4.1 Multidimensional Analysis of Multimedia Data...... 122 6.4.2 Classification and Prediction Analysis of Multimedia Data...... 122 6.4.3 Mining Associations in Multimedia Data...... 122 6.4.4 Audio and Video Data Mining...... 122 6.5 Text Mining...... 122 6.6 Query Processing Techniques...... 123 6.6.1 Ways of dimensionality Reduction for Text...... 123 6.6.2 Probabilistic Latent Semantic Indexing schemas ...... 123 6.6.3 Mining the World Wide Web...... 124 6.6.4 Challenges...... 124 6.7 Data Mining for Healthcare Industry...... 124 6.8 Data Mining for Finance...... 124 6.9 Data Mining for Retail Industry...... 124 6.10 Data Mining for Telecommunication...... 124 6.11 Data Mining for Higher Education...... 125 6.12 Trends in Data Mining...... 125 6.12.1 Application Exploration ...... 125 6.12.2 Scalable Data Mining Methods ...... 125 6.12.3 Combination of Data Mining with Database Systems, Data Warehouse Systems, and Web Database Systems...... 125 6.12.4 Standardisation of Data Mining Language...... 125 6.12.5 Visual Data Mining ...... 126 6.12.6 New Methods for Mining Complex Types of Data ...... 126 6.12.7 Web Mining ...... 126 6.13 System Products and Research Prototypes...... 126 6.13.1 Choosing a Data Mining System...... 126 6.14 Additional Themes on Data Mining...... 128 6.14.1 Theoretical Foundations of Data Mining...... 128 6.14.2 Statistical Data mining...... 129 6.14.3 Visual and Audio Data Mining...... 130 6.14.4 Data Mining and Collaborative Filtering...... 130 Summary...... 132 References...... 132 Recommended Reading...... 132 Self Assessment...... 133

Chapter VII...... 135 Implementation and Maintenance...... 135 Aim...... 135 Objectives...... 135 Learning outcome...... 135 7.1 Introduction...... 136 7.2 Physical Design Steps...... 136 7.2.1 Develop Standards...... 136 7.2.2 Create Aggregates Plan...... 137 7.2.3 Determine the Data Partitioning Scheme...... 137

V/JNU OLE 7.2.4 Establish Clustering Options...... 137 7.2.5 Prepare an Indexing Strategy...... 138 7.2.6 Assign Storage Structures...... 138 7.2.7 Complete Physical Model...... 138 7.3 Physical Storage...... 138 7.3.1 Storage Area Data Structures...... 138 7.3.2 Optimising Storage...... 139 7.3.3 Using RAID Technology...... 140 7.4 Indexing the Data Warehouse...... 142 7.4.1 B-Tree Index...... 142 7.4.2 Bitmapped Index...... 143 7.4.3 Clustered Indexes...... 143 7.4.4 Indexing the ...... 144 7.4.5 Indexing the Dimension Tables...... 144 7.5 Performance Enhancement Techniques...... 144 7.5.1 Data Partitioning...... 144 7.5.2 Data Clustering...... 145 7.5.3 Parallel Processing...... 145 7.5.4 Summary Levels...... 145 7.5.5 Referential Integrity Checks...... 146 7.5.6 Initialisation Parameters...... 146 7.5.7 Data Arrays...... 146 7.6 Data Warehouse Deployment...... 146 7.6.1 Data warehouse Deployment Lifecycle...... 147 7.7 Growth and Maintenance...... 148 7.7.1 Monitoring the Data Warehouse...... 148 7.7.2 Collection of Statistics...... 149 7.7.3 Using Statistics for Growth Planning...... 150 7.7.4 Using Statistics for Fine-Tuning...... 150 7.7.5 Publishing Trends for Users...... 151 7.8 Managing the Data Warehouse...... 151 7.8.1 Platform Upgrades...... 152 7.8.2 Managing Data Growth...... 152 7.8.3 Storage Management...... 152 7.8.4 ETL Management...... 153 7.8.5 Information Delivery Enhancements...... 153 7.8.6 Ongoing Fine-Tuning...... 153 7.9 Models of Data Mining...... 154 Summary...... 156 References...... 156 Recommended Reading...... 156 Self Assessment...... 157

VI/JNU OLE List of Figures

Fig. 1.1 Data warehousing...... 3 Fig. 1.2 Steps in data warehouse iteration project planning stage...... 7 Fig. 1.3 Means of identifying required information...... 8 Fig. 1.4 Typical data warehousing environment...... 10 Fig. 1.5 Overview of data warehouse infrastructure...... 12 Fig. 1.6 Data warehouse metadata...... 13 Fig. 1.7 Importance of mapping between two environments...... 14 Fig. 1.8 Simplest component of metadata...... 14 Fig. 1.9 Storing mapping information in the data warehouse...... 15 Fig. 1.10 Keeping track of when extracts have been run...... 15 Fig. 1.11 Other useful metadata...... 16 Fig. 2.1 Data design...... 21 Fig. 2.2 E-R modelling for OLTP systems...... 22 Fig. 2.3 Dimensional modelling for data warehousing...... 22 Fig. 2.4 Simple STAR schema for orders analysis...... 23 Fig. 2.5 Understanding a query from the STAR schema...... 24 Fig. 2.6 STAR schema keys...... 25 Fig. 2.7 Source identification process...... 27 Fig. 2.8 Data in operational systems...... 28 Fig. 2.9 Immediate data extraction options...... 30 Fig. 2.10 Data extraction using replication technology...... 31 Fig. 2.11 Deferred data extraction...... 32 Fig. 2.12 Typical data source environment...... 36 Fig. 2.13 Enterprise plan-execute-assess closed loop...... 41 Fig. 3.1 Data mining is the core of knowledge discovery process...... 47 Fig. 3.2 Steps for data mining projects...... 51 Fig. 3.3 Six-sigma methodology...... 51 Fig. 3.4 SEMMA...... 51 Fig. 3.5 CRISP–DM is an iterative, adaptive process...... 53 Fig. 3.6 Data mining techniques...... 55 Fig. 3.7 Methods of mining frequent subgraphs...... 56 Fig. 3.8 Heavy-tailed out-degree and in-degree distributions...... 57 Fig. 3.9 A financial multirelational schema...... 60 Fig. 3.10 Basic sequential covering algorithm...... 65 Fig. 3.11 A general-to-specific search through rule space...... 66 Fig. 3.12 A multilayer feed-forward neural network...... 67 Fig. 3.13 A hierarchical structure for STING clustering...... 75 Fig. 3.14 EM algorithm...... 76 Fig. 3.15 Tabular representation of association...... 78 Fig. 3.16 Association Rules Networks, 3D...... 79 Fig. 4.1 Knowledge base...... 85 Fig. 4.2 Sequential structure of KDP model...... 88 Fig. 4.3 Relative effort spent on specific steps of the KD process...... 89 Fig. 4.4 Web mining architecture...... 90 Fig. 5.1 Flow diagram of remote sensing image classification with inductive learning...... 103 Fig. 5.2 Three numerical characteristics...... 104 Fig. 5.3 Using spatial information for estimation from a sample...... 106 Fig. 5.4 Different layers of user query processing...... 109 Fig. 5.5 A Markov diagram that describes the probability of program enrolment changes...... 111 Fig. 6.1 Spatial mining...... 120 Fig. 6.2 Text mining...... 123 Fig. 7.1 Physical design process...... 136 Fig. 7.2 Data structures in the warehouse...... 139

VII/JNU OLE Data Mining

Fig. 7.3 RAID technology...... 141 Fig. 7.4 B-Tree index example...... 143 Fig. 7.5 Data warehousing monitoring...... 149 Fig. 7.6 Statistics for the users...... 151

VIII/JNU OLE List of Tables

Table 1.1 Example of source data...... 4 Table 1.2 Example of target data (Data Warehouse)...... 4 Table 1.3 Data warehousing elements...... 6 Table 2.1 Basic tasks in data transformation...... 34 Table 2.2 Data transformation types...... 36 Table 2.3 Characteristics or indicators of high-quality data...... 40 Table 2.4 General areas where data warehouse can assist in the planning and assessment phases...... 42 Table 3.1 The six phases of CRISP-DM...... 55 Table 3.2 Requirements of clustering in data mining...... 71 Table 5.1 Spatial data mining and knowledge discovery in various viewpoints...... 100 Table 5.2 Main spatial knowledge to be discovered...... 101 Table 5.3 Techniques to be used in SDMKD...... 102 Table 5.4 Terms related to temporal data...... 108

IX/JNU OLE Data Mining

Abbreviations

ANN - Artificial Neural Network ANOVA - Analysis Of Variance ARIMA - autoregressive integrated moving average ASCII - American Standard Code for Information Interchange ATM - Automatic Banking Machines BQA - Business Question Assessment C&RT - Cart Modelling CA - California CGI - Computer Generated Imagery CHAID - CHi-squared Automatic Interaction Detector CPU - Central Processing Unit CRISP-DM - Cross Industry Standard Process for Data Mining DB - Database DBMS - Database Management System DBSCAN - Density-Based Spatial Clustering of Applications with Noise DDL - definition language statements DM - Data Mining DMKD - Data Mining and Knowledge Discovery DNA - DeoxyriboNucleic Acid DSS - DW - Data Warehouse EBCDIC - Extended Binary Coded Decimal Interchange Code EIS - Executive Information System EM - Expectation-Maximisation En - Entropy ETL - Extraction, Transformation and Loading Ex - Expected value GPS - Global Positioning System GUI - Graphical User Interface HIV - Human Immunodeficiency Virus HTML - Hypertext Markup Language IR - Information retrieval IRC - Instant Relay Chat JAD - JAva Decompiler KDD - Knowledge Discovery and Data Mining KDP - Knowledge Discovery Processes KPI - Key Performance Indication MB - Megabytes MBR - Master Boot Record MRDM - Multirelational data mining NY - New York ODBC - Open Database Connectivity OLAP - Online Analytical Processing OLTP - Online Transaction Processing OPTICS - Ordering Points to Identify the Clustering Structure PC - Personal Computer RAID - Redundant array of inexpensive disks RBF - Radial-Basis Function RDBMS - Relational Database Management System SAS - Statisctial Analysis Software SDMKD - Spatial Data Mining and Knowledge Discovery SEMMA - Sample, Explore, Modify, Model, Assess SOLAM - Spatial Online Analytical Mining

X/JNU OLE STING - Statistical Information Grid TM - Temporal Mediator UAT - User Acceptance Testing VLSI - Very Large Scale Integration WWW - World Wide Web XML - Extensible Markup Language

XI/JNU OLE XII/JNU OLE Chapter I Data Warehouse – Need, Planning and Architecture

Aim

The aim of this chapter is to:

• introduce the concept of data warehousing

• analyse the need for data warehousing

• explore the basic elements of data warehousing

Objectives

The objectives of this chapter are to:

• discuss planning and requirements for successful data warehousing

• highlight the architecture and infrastructure of data warehousing

• describe the operational applications in data warehousing

Learning outcome

At the end of this chapter, you will be able to:

• enlist the components of metadata

• comprehend the concept of metadata

• understand what is

1/JNU OLE Data Mining

1.1 Introduction Data warehousing is combining data from various and usually diverse sources into one comprehensive and easily operated database. Common accessing systems of data warehousing include queries, analysis and reporting. As data warehousing creates one database at the end, the number of sources can be anything, provided that the system can handle the volume. The final result, however, is uniform data, which can be more easily manipulated.

Definition: Although there are several definitions of data warehouse, a widely accepted definition by Inmon (1992) is an integrated subject-oriented and time-variant repository of information in support of management’s decision making process. According to Kimball, a data warehouse is “a copy of transaction data specifically structured for query and analysis”. It is a copy of sets of transactional data, which can come from a range of transactional systems. • Data warehousing is commonly used by companies to study trends over time. However, its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews, forecasts, business models and similar analytical tools, reports and projections can be made. • Normally, as the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only. After building a new query at the end, the data stored via data warehousing can be updated. • Besides being a storehouse for large amount of data, data warehouse must possess systems in place that make it easy to access the data and use it for day to day operations. • A data warehouse is sometimes said to be a major role player in a decision support system (DSS). DSS is a technique used by organisations to come up with facts, trends or relationships that can assist them to make effective decisions or create effective strategies to accomplish their organisational goals. • Data warehouses involve a long-term effort and are usually built in an incremental fashion. In addition to adding new subject areas, at each cycle, the breadth of data content of existing subject areas is usually increased as users expand their analysis and their underlying data requirements. • Users and applications can directly use the data warehouse to perform their analysis. Alternately, a subset of the data warehouse data, often relating to a specific line-of business and/or a specific functional area, can be exported to another, smaller data warehouse, commonly referred to as a data mart. • Besides integrating and cleansing an organisation’s data for better analysis, one of the benefits of building a data warehouse is that the effort initially spent to populate it with complete and accurate data content further benefits any data marts that are sourced from the data warehouse.

2/JNU OLE Operational Metadata User

Metadata and information VAX RMS objects

Use!

VAX Database Access to Find operational Transform and and external and understand data Distribute Data PC Data warehouse

Hardcopy

Data and process flow management and automation

Fig. 1.1 Data warehousing (Source: http://dheise.andrews.edu/dw/Avondale/images/fig06.gif)

The benefits of implementing a data warehouse are as follows: • This may appear rather obvious but it is not uncommon in an enterprise for two database systems to have two different versions of the truth. It is very rarely found in a university in which everyone agrees with financial figures of income and expenditure at each reporting time during the year. • To speed up ad hoc reports and queries involving aggregations across many attributes, which are resource intensive. The managers require trends, sums and aggregations that allow, for example, comparing this year’s performance to last year’s or preparation of forecasts for next year. • To provide a system in which managers, who do not have a strong technical background are able to run complex queries. If the managers are able to access the information they require, it is likely to reduce the bureaucracy around the managers. • To provide a database that stores relatively clean data. By using a good ETL process, the data warehouse should have data of high quality. When errors are discovered it may be desirable to correct them directly in the data warehouse and then propagate the corrections to the OLTP systems. • To provide a database that stores historical data that may have been deleted from the OLTP systems. To improve response time, historical data is usually not retained in OLTP systems other than that which is required to respond to customer queries. The data warehouse can then store the data that is purged from the OLTP systems.

Example: In order to store data, many application designers in every branch have made their individual decisions as to how an application and database should be built. Thus, source systems will be different in naming conventions, variable measurements, encoding structures and physical attributes of data.

3/JNU OLE Data Mining

Consider an institution that has got several branches in various countries having hundreds of students. The following example explains how the data is integrated from source systems to target systems.

System Attribute Column Name Datatype Values Name Name

Student Source Application STUDENT_APPLICATION_DATE NUMERIC(8,0) 11012005 system 1 Date Student Source Application STUDN_APPLICATION_DATE DATE 11012005 system 2 Date

Source Application APPLICATION_DATE DATE 01NOV2005 system 3 Date

Table 1.1 Example of source data

In the above example, the attribute name, column name, datatype and values are totally different from one source system to another; this inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.

System Data Attribute Name Column Name Values Name type

Record # 1 Student Application Date STUDENT_APPLICATION_DATE DATE 01112005

Record # 2 Student Application Date STUDENT_APPLICATION_DATE DATE 01112005

Record # 3 Student Application Date STUDENT_APPLICATION_DATE DATE 01112005

Table 1.2 Example of target data (Data Warehouse)

In the above example of target data, attribute names, column names and datatypes are consistent throughout the target system. This is how data from various source systems is integrated and accurately stored into the data warehouse.

1.2 Need for Data Warehousing Data Warehousing is an important part and in most cases the foundation of architecture. However, it is necessary to know the need for data warehouse. Following points will help in understanding the need for data warehousing.

Data Integration Data warehouse helps in combining scattered and unmanageable data into a particular format, which can be easily accessible. If the required is complicated, it may lead to inaccuracy in business. By arranging the data properly in a particular format, it is easy to analyse it across all products by location, time and channel.

The IT staff provides the reports required from time to time through a series of manual and automated steps of stripping or extracting the data from one source, sorting / merging with data from other sources, manually scrubbing and enriching the data and then running reports against it.

4/JNU OLE Data warehouse serves not only as a repository for historical data but also as an excellent data integration platform. The data in the data warehouse is integrated, subject oriented, time-variant and non-volatile to enable you to get a 360° view of your organisation.

Advanced reporting and analysis The data warehouse is designed specifically to support querying, reporting and analysis tasks. The data model is flattened and structured by subject areas to make it easier for users to get even complex summarised information with a relatively simple query and perform multi-dimensional analysis. This has two powerful benefits – multilevel trend analysis and end-user empowerment. Multi-level trend analysis provides the ability to analyse key trends at every level across several different dimensions, for instance, Organisation, Product, Location, Channel and Time, and hierarchies within them. Most reporting, and visualisation tools take advantage of the underlying data model to provide powerful capabilities such as drilldown, roll-up, drill-across and various ways of slicing and dicing data.

The flattened data model makes it much easier for users to understand the data and write queries rather than work with potentially several hundreds of tables and write long queries with complex table joins and clauses.

Knowledge discovery and decision support Knowledge discovery and data mining (KDD) is an automatic extraction of non-obvious hidden knowledge from large volumes of data. For example, Classification models could be used to classify members into low, medium and high lifetime value. Instead of coming up with a one-size-fits-all product, the membership can be divided into different clusters based on member profile using Clustering models, and products could be customised for each cluster. Affinity groupings could be used to identify better product bundling strategies.

These KDD applications use various statistical and data mining techniques and rely on subject oriented, summarised, cleansed and “de-noised” data, which a well designed data warehouse can readily provide. The data warehouse also enables an Executive Information System (EIS). Executives typically could not be expected to go through different reports trying to get a holistic picture of the organisation’s performance and make decisions. They need the KPIs delivered to them. Some of these KPIs may require cross product or cross departmental analysis, which may be too manually intensive, if not impossible, to perform on raw data from operational systems. This is especially relevant to relationship marketing and profitability analysis. The data in data warehouse is already prepared and structured to support this kind of analysis.

Performance Finally, the performance of transactional systems and query response time make the case for a data warehouse. The transactional systems are meant to do just that – perform transactions efficiently – and hence, are designed to optimise frequent database reads and writes. The data warehouse, on the other hand, is designed to optimise frequent complex querying and analysis. Some of the ad-hoc queries and interactive analysis, which could be performed in few seconds to minutes on a data warehouse could take a heavy toll on the transactional systems and literally drag their performance down. Holding historical data in transactional systems for longer period of time could also interfere with their performance. Hence, the historical data needs to find its place in the data warehouse.

1.3 Basic Elements of Data Warehousing Basic elements of data warehousing are explained below.

5/JNU OLE Data Mining

An operational system of record whose function is to capture the transactions of the business. A source system is often called a “legacy system” in the mainframe environment. The main priorities of the source Source System system are uptime and availability. Queries against source systems are narrow, “account-based” queries that are part of the normal transaction flow and severely restricted in their demands on the legacy system.

A storage area and set of processes that clean, transform, combine, de- duplicate, household, archive and prepare source data for use in the presentation server. In many cases, the primary objects in this area are Staging Area a set of flat-file tables representing extracted (from the source systems) data, loading and transformation routines, and a resulting set of tables containing clean data – Dynamic Data Store. This area does not usually provide query and presentation services.

The presentation areas are the target physical machines on which the data warehouse data is organised and stored for direct querying by end users, report writers, and other applications. The set of presentable Presentation Area data, or Analytical Data Store, normally take the form of dimensionally modelled tables when stored in a relational database, and cube files when stored in an OLAP database.

End user data access tools are any clients of the data warehouse. An end End user Data Access Tools user access tool can be as simple as an ad hoc query tool, or can be as complex as a sophisticated data mining or modelling application.

All of the information in the data warehouse environment that is not Metadata the actual data itself. This data about data is catalogued, versioned, documented, and backed up.

Table 1.3 Data warehousing elements

Planning and requirements For successful data warehousing, proper planning and management is necessary. For this, it is necessary to fulfil all necessary requirements. Bad planning and improper project management practice is the main factor for failures in data warehouse project planning. First of all, make sure that your company really needs data warehouse for their business support. Later, prepare criteria for assessing the value expected from data warehouse. Decide the software on this project and make sure where the data warehouse will collect its data sources. You need to make rules on who will be using the data and who will operate the new systems.

1.4 Project Planning and Management Data warehousing is available in all shapes and sizes, which bear a direct relationship to cost and time involved. The approach to starting a data warehousing project will differentiate and the steps listed below are summary of some of the points to consider.

Get professional advice Data warehousing will save a huge bundle to get professional advice upfront. Endless meeting times can be saved and the risk of an abandon data warehousing project can be reduced.

Plan the data Understand what metrics you want to in the data warehouse and make sure that there is an appropriate data to provide for the analysis. If you want to obtain periodic Key Performance Index (KPI) data for shipping logistics, ensure that the appropriate data is piped into the data warehouse.

6/JNU OLE Who will use the Data Warehouse The power data warehouse consumers are business and financial managers. Data warehouses are meant to deliver clear indications on how the business is performing. Plot out the expected users for the data warehouse in an enterprise. See to it that they will have the appropriate reports in a format, which is quickly understandable. Ensure that planning exercises are conducted in advance to accumulate scenarios on how the data warehouse will be used. Always keep in mind that data has to be presented attractively in a format so as business managers will feel comfortable. Text files with lines of numbers will not suffice!

Integration to external applications Most data warehousing projects sink or swim by their ability to extract data from external applications. Enterprises have a slew of applications either developed in-house or obtain from a vendor. Conceptually, your data warehouse will act as the heart to diverse applications running in the enterprise. All important data will flow in or out of the data warehouse.

Technology Data warehouse will be built from one of the major relational Database Management System (DBMS) vendors like Oracle, IBM, Microsoft, and many more. Open source databases, like mySQL, can also support Data Warehousing with the right support in place.

The data warehouse is implemented (populated) one subject area at a time, driven by specific business questions to be answered by each implementation cycle. The first and subsequent implementation cycles of the data warehouse are determined during the Business Question Assessment (BQA) stage, which may have been conducted as a separate project. At this stage in the data warehouse process or at the start of this development/implementation project, the first (or next if not first) subject area implementation project is planned.

The business requirements discovered in BQA or an equivalent requirements gathering project and, to a lesser extent, the technical requirements of the Architecture Review and Design stage (or project) are now refined through user interviews and focus sessions. The requirements should be refined to the subject area level and further analysed to yield the detail needed to design and implement a single population project, whether initial or follow-on. The data warehouse project team is expanded to include the members needed to construct and deploy the Warehouse, and a detailed work plan for the design and implementation of the iteration project is developed and presented to the customer organisation for approval.

The following diagram illustrates the sequence in which steps in the data warehouse iteration project planning stage must be conducted.

Define detailed business and technical requirements

Refine data model and Source data inventory

Plan Iteration Development Project

Obtain Iteration project approval and funding

Fig. 1.2 Steps in data warehouse iteration project planning stage (Source: http://www.gantthead.com/content/processes/9265.cfm#Description)

7/JNU OLE Data Mining

Collecting the requirements The informational requirements of the organisation need to be collected by means of a time box. Following figure shows the typical means by which those informational requirements are identified and collected.

Reports Existing Live analysis interviews

Fig. 1.3 Means of identifying required information (Source: http://inmoncif.com/inmoncif-old/www/library/whiteprs/ttbuild.pdf)

Typically informational requirements are collected by looking at:

Reports Existing reports can usually be gathered quickly and inexpensively. In most cases, the information displayed on these reports is easily discerned. However, old reports represent yesterday’s requirements and the underlying calculation of information may not be obvious ay all.

Spreadsheets Spreadsheets are able to be easily gathered by asking the DSS analyst community. Like standard reports, the information on spreadsheets is able to be discerned easily. The problem with spreadsheets: • They are very fluid, for example, important spreadsheets may have been created several months ago that are not available now. • They change with no documentation. • They may not be able to be easily gathered unless the analyst creating them wants them to be gathered. • Their structure and usage of data may be obtuse.

Other existing analysis Through EIS and other channels, there is usually quite a bit of other useful information analysis that has been created by the organisation. This information is usually unstructured and very informal (although in many cases it is still valuable information.)

Live interviews Typically, through interviews or JAD sessions, the end user can tell about the informational needs of the organisation. Unfortunately, JAD sessions require an enormous amount of energy to conduct and assimilate. Furthermore, the effectiveness of JAD sessions depend in no small part on the imagination and spontaneity of the end user participating in the session.

In any case, gathering the obvious and easily accessed informational needs of the organisation should be done and should be factored into the data warehouse data model prior to the development of the first iteration of the data warehouse.

1.5 Architecture and Infrastructure The Architecture is the logical and physical foundation on which the data warehouse will be built. The architecture review and design stage, as the name implies, is both, a requirement analysis and a gap analysis activity. It is important to assess what pieces of the architecture already exist in an organisation (and in what form) and to assess what pieces are missing, which are needed to build the complete data warehouse architecture.

8/JNU OLE During the Architecture Review and Design stage, the logical data warehouse architecture is developed. The logical architecture is a configuration map of the necessary data stores that make up the Warehouse; it includes a central Enterprise Data Store, an optional , one or more (optional) individual business area Data Marts, and one or more Metadata stores. In the metadata, store(s) are two different kinds of metadata that catalogue reference information about the primary data.

Once the logical configuration is defined, the Data, Application, Technical and Support Architectures are designed to physically implement it. Requirements of these four architectures are carefully analysed, so that the data warehouse can be optimised to serve the users. Gap analysis is conducted to determine, which components of each architecture already exist in the organisation and can be reused, and which components must be developed (or purchased) and configured for the data warehouse.

The data architecture organises the sources and stores of business information and defines the quality and management standards for data and metadata.

The application architecture is the software framework that guides the overall implementation of business functionality within the Warehouse environment; it controls the movement of data from source to user, including the functions of data extraction, , data transformation, data loading, data refresh, and data access (reporting, querying).

The technical architecture provides the underlying computing infrastructure that enables the data and application architectures. It includes platform/server, network, communications and connectivity hardware/software/middleware, DBMS, client/server 2-tier vs.3-tier approach, and end-user workstation hardware/software. Technical architecture design must address the requirements of scalability, capacity and volume handling (including sizing and partitioning of tables), performance, availability, stability, chargeback, and security.

The support architecture includes the software components (example, tools and structures for backup/recovery, disaster recovery, performance monitoring, reliability/stability compliance reporting, data archiving, and version control/configuration management) and organisational functions necessary to effectively manage the technology investment.

Architecture review and design applies to the long-term strategy for development and refinement of the overall data warehouse, and is not conducted merely for a single iteration. This stage (or project) develops the blueprint of an encompassing data and technical structure, software application configuration, and organisational support structure for the Warehouse. It forms a foundation that drives the iterative Detail Design activities. Where Detail Design tells you what to do; Architecture Review and Design tells you what pieces you need in order to do it.

The architecture review and design stage can be conducted as a separate project that can run mostly in parallel with the business question assessment stage. For the technical, data, application and support infrastructure that enables and supports the storage and access of information is generally independent from the business requirements of which data is needed to drive the Warehouse. However, the data architecture depends on receiving input from certain BQA or alternative business requirements analysis activities (such as data source system identification and data modelling), therefore the BQA stage or similar business requirements identification activities must conclude before the Architecture stage or project can conclude.

The architecture will be developed based on the organisation’s long-term data warehouse strategy, so that each future iteration of the warehouse will be provided for and will fit within the overall data warehouse architecture.

Data warehouses can be architected in many different ways, depending on the specific needs of a business. The model shown below is the “hub-and-spokes” Data Warehousing architecture that is popular in many organisations.

9/JNU OLE Data Mining

In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of conformed data marts. Data is copied from one database to another using a technology called ETL (Extract, Transform, Load).

Operational Data marts applications

Customer database ETL

Sales ETL DW ETL Data ETL database staging warehouse area

ETL Products database

Fig. 1.4 Typical data warehousing environment (Source: http://data-warehouses.net/architecture/operational.html)

The description of the above diagram is as follows:

Operational applications The principal reason why business needs to create data warehouses is that their corporate data assets are fragmented across multiple, disparate applications systems, running on different technical platforms in different physical locations. This situation does not enable good decision making.

When data redundancy exists in multiple databases, data quality often deteriorates. Poor business intelligence results in poor strategic and tactical decision making.

Individual business units within an enterprise are designated as “owners” of operational applications and databases. These “organisational silos” sometimes do not understand the strategic importance of having well integrated, non- redundant corporate data. Consequently, they frequently purchase or build operational systems that do not integrate well with existing systems in the business.

Data management issues have deteriorated in recent years as businesses deployed a parallel set of e-business and ecommerce applications that do not integrate with existing “full service” operational applications.

Operational databases are normally “relational” - not “dimensional”. They are designed for operational, data entry purposes and are not well suited for online queries and analytics.

Due to globalisation, mergers and outsourcing trends, the need to integrate operational data from external organisations has arisen. The sharing of customer and sales data among business partners can, for example, increase business intelligence for all business partners.

The challenge for data warehousing is to be able to quickly consolidate, cleanse and integrate data from multiple, disparate databases that run on different technical platforms in different geographical locations.

10/JNU OLE Extraction transform loading ETL technology (shown in the fig.1.4with arrows) is an important component of the data warehousing Architecture. It is used to copy data from operational applications to the data warehouse staging area, from the DW staging area into the data warehouse and finally from the data warehouse into a set of conformed data marts that are accessible by decision makers.

The ETL software extracts data, transforms values of inconsistent data, cleanses “bad” data, filters data and loads data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately.

Data warehousing staging area The data warehouse staging area is temporary location where data from source systems is copied. A staging area is mainly required in a data warehousing architecture for timing reasons. In short, all required data must be available before data can be integrated into the data warehouse.

Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all operational databases at exactly the same time.

For example, it might be reasonable to extract sales data on a daily basis; however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract “customer” data from a database in Singapore at noon eastern standard time, but this would not be feasible for “customer” data in a Chicago database.

Data in the data warehouse can be either persistent (remains around for a long period) or transient (ionly remains around temporarily).

Not all business requires a data warehouse staging area. For many businesses, it is feasible to use ETL to copy data directly from operational databases into the data warehouse.

Data marts ETL (Extract Transform Load) jobs extract data from the data warehouse and populate one or more data marts for use by groups of decision makers in the organisations. The data marts can be dimensional (Star Schemas) or relational, depending on how the information is to be used and what “front end” data warehousing tools will be used to present the information.

Each data mart can contain different combinations of tables, columns and rows from the enterprise data warehouse.

For example, a business unit or user group that does not need enough of historical data might only need transactions from the current calendar year in the database. The personnel department might need to see all details about employees, whereas data such as “salary” or “home address” might not be appropriate for a data mart that focuses on Sales. Some data mart might need to be refreshed from the data warehouse daily, whereas user groups might need refreshes only monthly.

1.5.1 Infrastructure A data warehouse is a ‘business infrastructure’. In a practical world, it does not do anything on its own, but provides sanitised, consistent and integrated information for host of applications and end-user tools. Therefore, the stability, availability and response time of this platform is critical. Just like a foundation pillar, its strength is core to your information management success.

Various factors to be considered for data warehouse infrastructure are as follows:

11/JNU OLE Data Mining

Data Warehouse Data Size Data warehouses grow fast in terms of size. This is not only the increment to the data as per the current design you have. A data warehouse will have frequent additions of new dimensions, attributes and measures. With each such addition the data could take quantum jump, as you may bring in the entire historical data related to that additional dimensional model element. Therefore, as you estimate your data size, be on conservative side. • Data Dynamics for Data Warehouse: The volume and frequency of increment of data determines the processing speed and memory of the hardware platform. The increment of data should be typically on the daily basis. However, the level of increment could be different depending upon which data you are pulling in. Most of the times, you may pull in huge amount of data from the source system into staging area, but load much smaller size summary data in data warehouse.

Number of users of Data Warehouse The number of users is essentially the number of concurrent logins, which are on a data warehouse platform. Guessing the number of users of a data warehouse has the following complication: • Sometimes the user can be an end user tool, which may result in the actual number of users. For example, an enterprise reporting server can access the data warehouse in form of few users to generate all the enterprise reports Post that, the actual users are accessing the database and reports repository of the enterprise reporting system and not that of the data warehouse. Similarly, you might be using an analytics system, which creates its own local cube from a data warehouse. The actual users may be accessing that cube without logging into the data warehouse. Sometimes the users could be referring to the cache of the data warehouse distributed database and not referring to the main data warehouse. • There is no fixed formulae for calculating (and then linking) the number of users for the purpose of estimating the infrastructure needed. We assume that the data warehouse will be able to support large number of simultaneous login threads.

Pre-data Data Data Front-End warehouse Cleaning Repositories analytics

OLAP

Data Mart Data ETL warehouse Data mining OLTP server Data Mart Data ODS visualisation

Meta-Data Repository Reporting

Data Flow

Fig. 1.5 Overview of data warehouse infrastructure (Source: http://www.dwreview.com/Articles/Metadata.html)

12/JNU OLE 1.5.2 Metadata Metadata is data about data. Metadata has been around as long as there have been programs and data that the programs operate on. Following figure shows metadata in a simple form.

Application server tier

Infrastructure tier

Browser

Metadata repository

Fig. 1.6 Data warehouse metadata (Source: http://t1.gstatic.com)

While metadata is not new, the role of metadata and its importance in the face of the data warehouse certainly is new. For years, the information technology professional has worked in the same environment as metadata, but in many ways has paid little attention to metadata. The information professional has spent a life dedicated to process and functional analysis, user requirements, maintenance, architectures, and the like. The role of metadata has been passive at best in this situation.

However, metadata plays a very different role in data warehouse. Relegating metadata to a backwater, passive role in the data warehouse environment is to defeat the purpose of data warehouse. Metadata plays a very active and important part in the data warehouse environment. The reason why metadata plays such an important and active role in the data warehouse environment is apparent when contrasting the operational environment to the data warehouse environment in so far as the user community is concerned.

Mapping A basic part of the data warehouse environment is that of mapping from the operational environment into the data warehouse. The mapping includes a wide variety of facets, including, but not limited to: • mapping from one attribute to another • conversions • changes in naming conventions • changes in physical characteristics of data • filtering of data

Following figure shows the storing of the mapping in metadata for the data warehouse.

13/JNU OLE Data Mining

Metadata Operational environment

Mapping

Data warehouse Fig. 1.7 Importance of mapping between two environments (Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)

It may not be obvious why mapping information is so important in the data warehouse environment. Consider the vice president of marketing who has just asked for a new report. The DSS analyst turns to the data warehouse for the data for the report. Upon inspection, the vice president proclaims the report to be fiction. The credibility of the DSS analyst goes down until the DSS analyst can prove the data in the report to be valid. The DSS analyst first looks to the validity of the data in the warehouse. If the data warehouse data has not been reported properly, then the reports are adjusted. However, if the reports have been made properly from the data warehouse, the DSS analyst is in the position of having to go back to the operational source to salvage credibility. At this point, if the mapping data has been carefully stored, then the DSS analyst can quickly and gracefully go to the operational source. However, if the mapping has not been stored or has not been stored properly, then the DSS analyst has a difficult time defending his/her conclusions to management. The metadata store for the data warehouse is a natural place for the storing of mapping information.

1.5.3 Metadata Components Basic components The basic components of the data warehouse metadata store include the tables that are contained in the warehouse, the keys of those tables, and the attributes. Following figure shows these components of the data warehouse.

Metadata

• Tables in the warehouse • Keys of tables • Attributes in tables

Data warehouse Fig. 1.8 Simplest component of metadata (Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)

Mapping The typical contents of mapping metadata that are stored in the data warehouse metadata store are: • identification of source field(s) • simple attribute to attribute mapping • attributes conversions • physical characteristic conversions

14/JNU OLE • encoding/reference table conversions • naming changes • key changes • defaults • logic to choose from multiple sources • algorithmic changes, and so forth

Operational Metadata environment

Data warehouse

appl A appl B Mapping

appl C appl D

Fig. 1.9 Storing mapping information in the data warehouse (Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)

Extract History The actual history of extracts and transformations of data coming from the operational environment and heading for the data warehouse environment is another component that belongs in the data warehouse metadata store.

Operational environment Metadata Data warehouse

Extract history

Fig. 1.10 Keeping track of when extracts have been run (Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)

The extract history simply tells the DSS analyst when data entered the data warehouse. The DSS analyst has many uses for this type of information. One occasion is when the DSS analyst wants to know when the last time data in the warehouse was refreshed. Another occasion is when the DSS analyst wants to do what if processing and the assertions of analysis have changed. The DSS analyst needs to know whether the results obtained for one analysis are different from results obtained by an earlier analysis because of a change in the assertions or a change in the data. There are many cases where the DSS analyst needs to use the precise history of when insertions have been done to the data warehouse.

15/JNU OLE Data Mining

Miscellaneous Alias information is an attribute and key information that allows for alternative names. Alternative names often make a data warehouse environment much more “user friendly”. In some cases, technicians have influenced naming conventions that cause data warehouse names to be incomprehensible.

Metadata

• Alias information • Status • Volumetrics • Aging/purge criteria

Data warehouse

Fig. 1.11 Other useful metadata (Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)

In other cases, one department names for data have been entered into the warehouse, and another department would like to have their names for the data imposed. Alias’ are a good way to resolve these issues. Another useful data warehouse metadata component is that of status. In some cases, a data warehouse table is undergoing design. In other cases, the table is inactive or may contain misleading data. The existence of a status field is a good way to resolve these differences. Volumetrics are measurements about data in the warehouse. Typical volumetric information might include: • the number of rows currently in the table • the growth rate of the table • the statistical profile of the table • the usage characteristics of the table • the indexing for the table and its structure and • the byte specifications for the table..

Volumetric information is useful for the DSS analyst planning an efficient usage of the data warehouse. It is much more effective to consult volumetric before submitting a query that will use unknown resources than it is to simply submit the query and hope for the best.

Aging/purge criteria is also an important component of data warehouse metadata. Looking into the metadata store for a definition of the life cycle of data warehouse data is much more efficient than trying to divine the life cycle by examining the data inside the warehouse.

16/JNU OLE Summary • Data warehousing is combining data from various and usually diverse sources into one comprehensive and easily operated database. • Common accessing systems of data warehousing include queries, analysis and reporting. • Data warehousing is commonly used by companies to analyse trends over time. • Data Warehousing is an important part and in most cases it is the foundation of business intelligence architecture. • Data warehouse helps in combining scattered and unmanageable data into a particular format, which can be easily accessible. • The data warehouse is designed specifically to support querying, reporting and analysis tasks. • Knowledge discovery and data mining (KDD) is the automatic extraction of non-obvious hidden knowledge from large volumes of data. • For successful data warehousing, proper planning and management is necessary. For this it is necessary to fulfil all necessary requirements. Bad planning and improper project management practice is the main factor for failures in data warehouse project planning. • Data warehousing comes in all shapes and sizes, which bear a direct relationship to cost and time involved. • The architecture is the logical and physical foundation on which the data warehouse will be built. • The data warehouse staging area is temporary location where data from source systems is copied. • Metadata is data about data. Metadata has been around as long as there have been programs and data that the programs operate on. • A basic part of the data warehouse environment is that of mapping from the operational environment into the data warehouse.

References • Mailvaganam, H., 2007. Data Warehouse Project Management [Online] Available at: . [Accessed 8 September 2011]. • Hadley, L., 2002. Developing a Data Warehouse Architecture [Online] Available at: . [Accessed 8 September 2011]. • Humphries, M., Hawkins, M. W. And Dy, M. C., 1999. Data warehousing: architecture and implementation, Prentice Hall Profesional. • Ponniah, P., 2001. DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals, Wiley-Interscience Publication. • Kumar, A., 2008. Data Warehouse Layered Architecture 1 [Video Online] Available at: < http://www.youtube. com/watch?v=epNENgd40T4>. [Accessed 11 September 2011]. • Intricity101, 2011. What is OLAP? [Video Online] Available at: . [Accessed 12 September 2011].

Recommended Reading • Parida, R., 2006. Principles & Implementation of Data Warehousing, Firewell Media. • Khan, A., 2003. Data Warehousing 101: Concepts and Implementation, iUniverse. • Jarke, M., 2003. Fundamentals of data warehouses, 2nd ed., Springer.

17/JNU OLE Data Mining

Self Assessment 1. ______is combining data from diverse sources into one comprehensive and easily operated database? a. Data warehousing b. Data mining c. Mapping d. Metadata

2. ______is commonly used by companies to analyse trends over time. a. Data mining b. Architecture c. Planning d. Data warehousing

3. ­­­­­­­­­______is a technique used by organisations to come up with facts, trends or relationships that can help them make effective decisions. a. Mapping b. Operation analysis c. Decision support system d. Data integration

4. Which statement is false? a. Knowledge discovery and data mining (KDD) is the automatic extraction of non-obvious hidden knowledge from large volumes of data. b. The data warehouse enables an Executive Information System (EIS). c. The data in data warehouse is already prepared and structured to support this kind of analysis. d. The metadata is designed specifically to support querying, reporting and analysis tasks.

5. Bad planning and improper ______practice is the main factor for failures in data warehouse project planning. a. project management b. operation management c. business management d. marketing management

6. ______comes in all shapes and sizes, which bears a direct relationship to cost and time involved. a. Metadata b. Data mining c. Mapping d. Data warehousing

7. The ______simply tells the DSS analyst when data entered the data warehouse. a. mapping b. components c. extract history d. miscellaneous

18/JNU OLE 8. Which of the following information is useful for the DSS analyst planning an efficient usage of the data warehouse? a. Matrices b. Volumetric c. Algebraic d. Statistical

9. Which of the following criteria is an important component of data warehouse metadata. a. Aging/purge b. Mapping c. Data mart d. Status

10. ______in the data warehouse can be either persistent or transient. a. Metadata b. Mapping c. Data d. Operational database

19/JNU OLE Data Mining

Chapter II Data Design and Data Representation

Aim

The aim of this chapter is to:

• introduce the concept of spatial data mining and knowledge discovery

• analyse different techniques of spatial data mining and knowledge discovery

• explore the sequence mining

Objectives

The objectives of this chapter are to:

• explicate the cloud model

• describe the elucidate temporal mining

• elucidate database mediators

Learning outcome

At the end of this chapter, you will be able to:

• enlist types of temporal data

• comprehend temporal data processing

• understand the classification techniques

20/JNU OLE 2.1 Introduction Data design consists of putting together the data structures. A group of data elements form a data structure. Logical data design includes determination of the various data elements that are needed and combination of the data elements into structures of data. Logical data design also includes establishing the relationships among the data structures.

Observe in the following figure, how the phases start with requirements gathering. The results of the requirements gathering phase is documented in detail in the requirements definition document. An essential component of this document is the set of information package diagrams. Remember that these are information matrices showing the metrics, business dimensions, and the hierarchies within individual business dimensions. The information package diagrams form the basis for the logical data design for the data warehouse. The data design process results in a dimensional data model.

Requirements Definition Document

Requirements Information gathering packages

Dimen- Data sional Design model

Fig. 2.1 Data design

2.2 Design Decision Following are the design decisions you have to make: • Choosing the process: Selecting the subjects from the information packages for the first set of logical structures to be designed. • Choosing the grain: Determining the level of detail for the data in the data structures. • Identifying and conforming the dimensions: Choosing the business dimensions (such as product, market, time and so on) to be included in the first set of structures and making sure that each particular data element in every business dimension is conformed to one another. • Choosing the facts: Selecting the metrics or units of measurements (such as product sale units, dollar sales, dollar revenue and so on.) to be included in the first set of structures. • Choosing the duration of the database: Determining how far back in time you should go for historical data.

2.3 Use of CASE Tools There are many case tools available for data modelling of which you can use for creating the logical schema and the physical schema for specific target database management systems (DBMSs).

21/JNU OLE Data Mining

You can use a case tool to define the tables, the attributes, and the relationships. You can assign the primary keys and indicate the foreign keys. You can form the entity-relationship diagrams. All of this is done very easily using graphical user interfaces and powerful drag-and-drop facilities. After creating an initial model, you may add fields, delete fields, change field characteristics, create new relationships, and make any number of revisions with utmost ease.

Another very useful function found in the case tools is the ability to forward-engineer the model and generate the schema for the target database system you need to work with. Forward-engineering is easily done with these case tools.

• OLTP systems capture details of events or transactions • OLTP systems focus on individual events • An OLTP system is a window into micro-level transactions • Picture at detail level necessary to run the business • Suitable only for questions at transaction level • Data consistency, non-redundancy, and efficient critical

Entity-Relationship Modelling Removes data redundancy Ensures data consistency Expresses microscopic relationships

Fig. 2.2 E-R modelling for OLTP systems

• DW meant to answer questions on an overall process • DW focus is on how managers view the business • DW reveals business trends • Information is cantered around a business process • Answers show how the business measures the process • The measures to be studied in many ways along several business dimensions

Dimensional Modelling Captures critical measures Views along dimensions Intuitive to business users

Fig. 2.3 Dimensional modelling for data warehousing

For modelling the data warehouse, one needs to know the dimensional modelling technique. Most of the existing vendors have expanded their modelling case tools to include dimensional modelling. You can create fact tables, dimension tables, and establish the relationships between each dimension table and the fact table. The result is a STAR schema for your model. Again, you can forward-engineer the dimensional STAR model into a relational schema for your chosen database management system.

22/JNU OLE 2.4 Star Schema Creating the STAR schema is the fundamental data design technique for the data warehouse. It is necessary to gain a good grasp of this technique.

2.4.1 Review of a Simple STAR Schema We will take a simple STAR schema designed for order analysis. Assume this to be the schema for a manufacturing company and that the marketing department is interested in determining how they are doing with the orders received by the company. Following figure shows this simple STAR schema. It consists of the orders fact table shown in the middle of schema diagram. Surrounding the fact table are the four dimension tables of customer, salesperson, order date, and product. Let us begin to examine this STAR schema. Look at the structure from the point of view of the marketing department.

Customer Product Customer name Product name Customer code SKU Billing address Brand Shipping address Order Measures Order dollars cost Margin dollars quantity sold Order Date Salesperson Date Salesperson name Month Territory name Quarter Region name Year

Fig. 2.4 Simple STAR schema for orders analysis

The users in this department will analyse the orders using dollar amounts, cost, profit margin, and sold quantity. This information is found in the fact table of the structure. The users will analyse these measurements by breaking down the numbers in combinations by customer, salesperson, date, and product. All these dimensions along which the users will analyse are found in the structure. The STAR schema structure is a structure that can be easily understood by the users and with which they can comfortably work. The structure mirrors how the users normally view their critical measures along with their business dimensions.

When you look at the order dollars, the STAR schema structure intuitively answers the questions of what, when, by whom, and to whom. From the STAR schema, the users can easily visualise the answers to these questions: For a given amount of dollars, what was the product sold? Who was the customer? Which salesperson brought the order? When was the order placed?

When a query is made against the data warehouse, the results of the query are produced by combining or joining one of more dimension tables with the fact table. The joins are between the fact table and individual dimension tables. The relationship of a particular row in the fact table is with the rows in each dimension table. These individual relationships are clearly shown as the spikes of the STAR schema.

Take a simple query against the STAR schema. Let us say that the marketing department wants the quantity sold and order dollars for product bigpart-1, relating to customers in the state of Maine, obtained by salesperson Jane Doe, during the month of June. Following figure shows how this query is formulated from the STAR schema. Constraints and filters for queries are easily understood by looking at the STAR schema.

23/JNU OLE Data Mining

Product name = bigpart-1 State = Maine

Customer Product Customer name Product name Customer code SKU Billing address Brand Shipping address Order Measures Order dollars cost Margin dollars quantity sold Order Date Salesperson Date Salesperson name Month Territory name Quarter Region name Year

Month = June Salesperson Name = Jane Doe

Fig. 2.5 Understanding a query from the STAR schema

2.4.2 Star Schema Keys Following figure illustrates how the keys are formed for the dimension and fact tables.

24/JNU OLE Fact table Store dimension Store dimension STORE KEY PRODUCT KEY TIME KEY Time dimension Store Desc Dollars District ID Unit District Desc Region ID Region Desc Level

Fact table: Compound primary key, one segment for each dimension

Dimension table: Generated primary key

Fig. 2.6 STAR schema keys

Primary Keys Each row in a dimension table is identified by a unique value of an attribute designated as the primary key of the dimension. In a product dimension table, the primary key identifies each product uniquely. In the customer dimension table, the customer number identifies each customer uniquely. Similarly, in the sales representative dimension table, the social security number of the sales representative identifies each sales representative.

Surrogate Keys There are two general principles to be applied when choosing primary keys for dimension tables. The first principle is derived from the problem caused when the product began to be stored in a different warehouse. In other words, the product key in the operational system has built-in meanings. Some positions in the operational system product key indicate the warehouse and some other positions in the key indicate the product category. These are built-in meanings in the key. The first principle to follow is: avoid built-in meanings in the primary key of the dimension tables.

The data of the retired customer may still be used for aggregations and comparisons by city and state. Therefore, the second principle is: do not use production system keys as primary keys for dimension tables. The surrogate keys are simply system-generated sequence numbers. They do not have any built-in meanings. Of course, the surrogate keys will be mapped to the production system keys. Nevertheless, they are different. The general practice is to keep the operational system keys as additional attributes in the dimension tables. Please refer back to Figure 2.5. The STORE KEY is the surrogate primary key for the store dimension table. The operational system primary key for the store reference table may be kept as just another non-key attribute in the store dimension table.

Foreign Keys Each dimension table is in a one-to-many relationship with the central fact table. So the primary key of each dimension table must be a foreign key in the fact table. If there are four dimension tables of product, date, customer, and sales representative, then the primary key of each of these four tables must be present in the orders fact table as foreign keys.

25/JNU OLE Data Mining

2.5 Dimensional Modelling Dimensional modelling gets its name from the business dimensions we need to incorporate into the logical data model. It is a logical design technique to structure the business dimensions and the metrics that are analysed along these dimensions. This modelling technique is intuitive for that purpose. The model has also proved to provide high performance for queries and analysis.

In multidimensional information package diagram we have discussed the foundation for the dimensional model. Therefore, the dimensional model consists of the specific data structures needed to represent the business dimensions. These data structures also contain the metrics or facts.

Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are described by common aspects of the business. Dimensional modelling has two basic concepts.

Important facts of dimensional modelling are explained below: • A fact is a collection of related data items, consisting of measures. • A fact is a focus of interest for the decision making process. • Measures are continuously valued attributes that describe facts. • A fact is a business measure.

Dimension is the parameter over which analysis of facts are performed. This parameter that gives meaning to a measure number of customers is a fact, perform analysis over time.

2.5.1 E-R Modelling versus Dimensional Modelling We are familiar with data modelling for operational or OLTP systems. We adopt the Entity-Relationship (E-R) modelling technique to create the data models for these systems. We have so far discussed the basics of the dimensional model and find that this model is most suitable for modelling the data for the data warehouse. It is necessary to summarise the characteristics of the data warehouse information and review how dimensional modelling is suitable for this purpose.

2.6 Data Extraction Two major factors differentiate the data extraction for a new operational system from the data extraction for a data warehouse. First, for a data warehouse, it is necessary to extract data from many disparate sources. Next, for a data warehouse, extract data on the changes for ongoing incremental loads as well as for a one-time initial full load. For operational systems, all that is needed is one-time extractions and data conversions.

These two factors increase the complexity of data extraction for a data warehouse and, therefore, warrant the use of third-party data extraction tools in addition to in-house programs or scripts. Third-party tools are generally more expensive than in-house programs, but they record their own metadata. On the other hand, in-house programs increase the cost of maintenance and are hard to maintain as source systems change.

If the company is in an industry where frequent changes to business conditions are the norm, then you may want to minimise the use of in-house programs. Third-party tools usually provide built-in flexibility. For this, change the input parameters for the third-part tool, which is in use. Effective data extraction is a key to the success of data warehouse. Therefore, special attention is required to the issue and formulates a data extraction strategy for your data warehouse. Here is a list of data extraction issues: • Source Identification: identify source applications and source structures. • Method of extraction: for each data source, define whether the extraction process is manual or tool-based. • Extraction frequency: for each data source, establish how frequently the data extraction must by done daily, weekly, quarterly, and so on. • Time window: for each data source, denote the time window for the extraction process.

26/JNU OLE • Job sequencing: determine whether the beginning of one job in an extraction job stream has to wait until the previous job has finished successfully. • Exception handling—: determine how to handle input records that cannot be extracted.

2.6.1 Source Identification Source identification, of course, encompasses the identification of all the proper data sources. It does not stop with just the identification of the data sources. It goes beyond that to examine and verify that the identified sources will provide the necessary value to the data warehouse.

Assume that a part of the database, maybe one of the data marts, is designed to provide strategic information on the fulfilment of orders. For this purpose, it is necessary to store historical information about the fulfilled and pending orders. If the orders are shipped through multiple delivery channels, one needs to capture data about these channels. If the users are interested in analysing the orders by the status of the orders as the orders go through the fulfilment process, then one needs to extract data on the order statuses. In the fact table for order fulfilment, one needs attributes about the total order amount, discounts, commissions, expected delivery time, actual delivery time, and dates at different stages of the process. One needs dimension tables for product, order disposition, delivery channel, and customer. First, it is necessary to determine if one has source systems to provide you with the data needed for this data mart. Then, from the source systems, one needs to establish the correct data source for each data element in the data mart. Further, go through a verification process to ensure that the identified sources are really the right ones.

Following figure describes a stepwise approach to source identification for order fulfilment. Source identification is not as simple process as it may sound. It is a critical first process in the data extraction function. You need to go through the source identification process for every piece of information you have to store in the data warehouse.

Source Source Identification Process Target

• List each data item of metrics or Product facts needed for analysis in fact tables. data

Order processing • List each dimension attribute from all dimensions. Customer • For each target data item, find the source system and source Customer data item Delivery • If there are multiple sources for channel data one data element, choose the Product preferred source. • Identify multiple source fields Diposition for a single target field and form data consolidation rules. Delivery contracts • Identify single source field for multiple target field for multiple Time target fields and establish splitting data Shipment tracking rules. • Ascertain defaults values. ORDER METRICS • Inspect source data for missing Inventory management values

Fig. 2.7 Source identification process

27/JNU OLE Data Mining

2.6.2 Data Extraction Techniques Business transactions keep changing the data in the source systems. In most cases, the value of an attribute in a source system is the value of that attribute at the current time. If you look at every data structure in the source operational systems, the day-to-day business transactions constantly change the values of the attributes in these structures. When a customer moves to another state, the data about that customer changes in the customer table in the source system. When two additional package types are added to the way a product may be sold, the product data changes in the source system. When a correction is applied to the quantity ordered, the data about that order gets changed in the source system.

Data in the source systems are said to be time-dependent or temporal. This is because source data changes with time. The value of a single variable varies over time.

Next, take the example of the change of address of a customer for a move from New York State to California. In the operational system, what is important is that the current address of the customer has CA as the state code. The actual change transaction itself, stating that the previous state code was NY and the revised state code is CA, need not be preserved. But think about how this change affects the information in the data warehouse. If the state code is used for analysing some measurements such as sales, the sales to the customer prior to the change must be counted in New York State and those after the move must be counted in California. In other words, the history cannot be ignored in the data warehouse. This arise the question: how to capture the history from the source systems? The answer depends on how exactly data is stored in the source systems. So let us examine and understand how data is stored in the source operational systems.

2.6.3 Data in Operational Systems These source systems generally store data in two ways. Operational data in the source system may be thought of as falling into two broad categories. The type of data extraction technique you have to use depends on the nature of each of these two categories.

Values of attributes as stored in Examples of attributes operational systems at different dates

Storing Current Value

Attribute: Customer’s State of Residence 6/1/2000 Value: OH 6/1/2000 9/15/2000 1/22/2001 3/1/2001

9/15/2000 Changed to CA OH CA NY NJ 1/22/2001 Changed to NY

3/1/2001 Changed to NJ

Storing Periodic Status Attribute: Status of property consigned to an auction house for sale. 6/1/2000 9/15/2000 1/22/2001 3/1/2001 6/1/2000 Value: RE (Property receipted)

9/15/2000 Changed to ES 6/1/2000 RE 6/1/2000 RE 6/1/2000 RE 6/1/2000 RE (Value estimated) 9/15/2000 ES 9/15/2000 ES 9/15/2000 ES 1/22/2001 AS 1/22/2001 AS 1/22/2001 Changed to AS 3/1/2001 SL (Assigned to auction)

3/1/2001 Changed to SL (Property sold)

Fig. 2.8 Data in operational systems

28/JNU OLE Current value Most of the attributes in the source systems fall into this category. Here, the stored value of an attribute represents the value of the attribute at this moment of time. The values are transient or transitory. As business transactions happen, the values change. There is no way to predict how long the present value will stay or when it will get changed next. Customer name and address, bank account balances, and outstanding amounts on individual orders are some examples of this category. What is the implication of this category for data extraction? The value of an attribute remains constant only until a business transaction changes it. There is no telling when it will get changed. Data extraction for preserving the history of the changes in the data warehouse gets quite involved for this category of data.

Periodic status This category is not as common as the previous category. In this category, the value of the attribute is preserved as the status every time a change occurs. At each of these points in time, the status value is stored with reference to the time when the new value became effective. This category also includes events stored with reference to the time when each event occurred. Look at the way data about an insurance policy is usually recorded in the operational systems of an insurance company. The operational databases store the status data of the policy at each point of time when something in the policy changes. Similarly, for an insurance claim, each event, such as claim initiation, verification, appraisal, and settlement, is recorded with reference to the points in time. For operational data in this category, the history of the changes is preserved in the source systems themselves. Therefore, data extraction for the purpose of keeping history in the data warehouse is relatively easier. Whether it is status data or data about an event, the source systems contain data at each point in time when any change occurred. Pay special attention to the examples. Having reviewed the categories indicating how data is stored in the operational systems, we are now in a position to discuss the common techniques for data extraction. When you deploy your data warehouse, the initial data as of a certain time must be moved to the data warehouse to get it started. This is the initial load. After the initial load, your data warehouse must be kept updated so the history of the changes and statuses are reflected in the data warehouse. Broadly, there are two major types of data extractions from the source operational systems: “as is” (static) data and data of revisions. ‚‚ “As is” or static data is the capture of data at a given point in time. It is like taking a snapshot of the relevant source data at a certain point in time. For current or transient data, this capture would include all transient data identified for extraction. In addition, for data categorised as periodic, this data capture would include each status or event at each point in time as available in the source operational systems. Primarily, you will use static data capture for the initial load of the data warehouse. Sometimes, you may want a full refresh of a dimension table. For example, assume that the product master of your source application is completely revamped. In this case, you may find it easier to do a full refresh of the product dimension table of the target data warehouse. Therefore, for this purpose, you will perform a static data capture of the product data. ‚‚ Data of revisions is also known as incremental data capture. Strictly, it is not incremental data but the revisions since the last time data was captured. If the source data is transient, the capture of the revisions is not easy. For periodic status data or periodic event data, the incremental data capture includes the values of attributes at specific times. Extract the statuses and events that have been recorded since the last date of extract. Incremental data capture may be immediate or deferred. Within the group of immediate data capture there are three distinct options. Two separate options are available for deferred data capture.

Immediate Data Extraction In this option, the data extraction is real-time. It occurs as the transactions happen at the source databases and files.

29/JNU OLE Data Mining

Source databases Transaction Source log operational files systems Source Data Option 1: Trigger programs Capture through transaction logs DBMS Extract files from source systems Option 2: Output files of trigger Capture through programs database triggers Option 3:

Capture in source application area staging Data

Fig. 2.9 Immediate data extraction options

Immediate Data Extraction is divided into three options:

Capture through Transaction Logs This option uses the transaction logs of the DBMSs maintained for recovery from possible failures. As each transaction adds, updates, or deletes a row from a database table, the DBMS immediately writes entries on the log file. This data extraction technique reads the transaction log and selects all the committed transactions. There is no extra overhead in the operational systems because logging is already part of the transaction processing. You have to ensure that all transactions are extracted before the log file gets refreshed. As log files on disk storage get filled up, the contents are backed up on other media and the disk log files are reused. Ensure that all log transactions are extracted for data warehouse updates.

If all source systems are database applications, there is no problem with this technique. But if some of your source system data is on indexed and other flat files, this option will not work for these cases. There are no log files for these non-database applications. You will have to apply some other data extraction technique for these cases. While we are on the topic of data capture through transaction logs, let us take a side excursion and look at the use of replication. Data replication is simply a method for creating copies of data in a distributed environment. Following figure illustrates how replication technology can be used to capture changes to source data.

30/JNU OLE Source Databases Transaction Source log operational files systems Source Data

DBMS

Log Transaction Manager

Replicated log Replication Transactions stored in server Data Staging Area

Fig. 2.10 Data extraction using replication technology

The appropriate transaction logs contain all the changes to the various source database tables. Here are the broad steps for using replication to capture changes to source data: ‚‚ Identify the source system DB table ‚‚ Identify and define target files in staging area ‚‚ Create mapping between source table and target files ‚‚ Define the replication mode ‚‚ Schedule the replication process ‚‚ Capture the changes from the transaction logs ‚‚ Transfer captured data from logs to target files ‚‚ Verify transfer of data changes ‚‚ Confirm success or failure of replication ‚‚ In metadata, document the outcome of replication ‚‚ Maintain definitions of sources, targets, and mappings

Capture through Database Triggers This option is applicable to source systems that are database applications. Triggers are special stored procedures (programs) that are stored on the database and fired when certain predefined events occur. You can create trigger programs for all events for which you need data to be captured. The output of the trigger programs is written to a separate file that will be used to extract data for the data warehouse. For example, if you need to capture all changes to the records in the customer table, write a trigger program to capture all updates and deletes in that table. Data capture through database triggers occurs right at the source and is therefore quite reliable. You can capture both before and after images. However, building and maintaining trigger programs puts an additional burden on the development effort. Also, execution of trigger procedures during transaction processing of the source systems puts additional overhead on the source systems. Further, this option is applicable only for source data in databases.

31/JNU OLE Data Mining

Capture in Source Application This technique is also referred to as application-assisted data capture. In other words, the source application is made to assist in the data capture for the data warehouse. You have to modify the relevant application programs that write to the source files and databases. You revise the programs to write all adds, updates, and deletes to the source files and database tables. Then other extract programs can use the separate file containing the changes to the source data. Unlike the previous two cases, this technique may be used for all types of source data irrespective of whether it is in databases, indexed files, or other flat files. But you have to revise the programs in the source operational systems and keep them maintained. This could be a formidable task if the number of source system programs is large. Also, this technique may degrade the performance of the source applications because of the additional processing needed to capture the changes on separate files.

Deferred Data Extraction In the cases discussed above, data capture takes place while the transactions occur in the source operational systems. The data capture is immediate or real-time. In contrast, the techniques under deferred data extraction do not capture the changes in real time. The capture happens later. Refer to the figure below for deferred data extraction options.

Source Databases Source operational systems Today’s Source Extract Data

DBMS Yesterday’s Extract extract programs File comparison Option 1: programs Extract capture based files based on date and time Option 2: Extract on file stamp capture by files based comparison comparing files on time stamp area staging Data

Fig. 2.11 Deferred data extraction

Two options of deferred data extraction are detailed below.

Capture Based on Date and Time Stamp Every time a source record is created or updated it may be marked with a stamp showing the date and time. The time stamp provides the basis for selecting records for data extraction. Here, the data capture occurs at a later time, not while each source record is created or updated. If you run your data extraction program at midnight every day, each day you will extract only those with the date and time stamp later than midnight of the previous day. This technique works well if the number of revised records is small. This technique presupposes that all the relevant source records contain date and time stamps. Provided this is true, data capture based on date and time stamp can work for any type of source file. This technique captures the latest state of the source data.

32/JNU OLE Any intermediary states between two data extraction runs are lost. Deletion of source records presents a special problem. If a source record gets deleted in between two extract runs, the information about the delete is not detected. You can get around this by marking the source record for delete first, do the extraction run, and then go ahead and physically delete the record. This means you have to add more logic to the source applications.

Capture by Comparing Files If none of the above techniques are feasible for specific source files in your environment, then consider this technique as the last resort. This technique is also called the snapshot differential technique because it compares two snapshots of the source data. Let us see how this technique works. Suppose you want to apply this technique to capture the changes to your product data.

While performing today’s data extractio n for changes to product data, you do a full file comparison between today’s copy of the product data and yesterday’s copy. You also compare the record keys to find the inserts and deletes. Then you capture any changes between the two copies.

This technique necessitates the keeping of prior copies of all the relevant source data. Though simple and straightforward, comparison of full rows in a large file can be very inefficient. However, this may be the only feasible option for some legacy data sources that do not have transaction logs or time stamps on source records.

2.7 Data Transformation Using several techniques, data extraction function is designed. Now the extracted data is raw data and it cannot be applied to the data warehouse. First, all the extracted data must be made usable in the data warehouse. Having information that is usable for strategic decision making is the underlying principle of the data warehouse. You know that the data in the operational systems is not usable for this purpose. Next, because operational data is extracted from many old legacy systems, the quality of the data in those systems is less likely to be good enough for the data warehouse.

Before moving the extracted data from the source systems into the data warehouse, you inevitably have to perform various kinds of data transformations. You have to transform the data according to standards because they come from many dissimilar source systems. You have to ensure that after all the data is put together, the combined data does not violate any business rules.

Consider the data structures and data elements that you need in your data warehouse. Now think about all the relevant data to be extracted from the source systems. From the variety of source data formats, data values, and the condition of the data quality, you know that you have to perform several types of transformations to make the source data suitable for your data warehouse. Transformation of source data encompasses a wide variety of manipulations to change all the extracted source data into usable information to be stored in the data warehouse. Many companies underestimate the extent and complexity of the data transformation functions. They start out with a simple departmental data mart as the pilot project. Almost all of the data for this pilot comes from a single source application. The data transformation just entails field conversions and some reformatting of the data structures. Do not make the mistake of taking the data transformation functions too lightly. Be prepared to consider all the different issues and allocate sufficient time and effort to the task of designing the transformations.

Irrespective of the variety and complexity of the source operational systems, and regardless of the extent of your data warehouse, you will find that most of your data transformation functions break down into a few basic tasks. Let us go over these basic tasks so that you can view data transformation from a fundamental perspective. Here is the set of basic tasks:

33/JNU OLE Data Mining

This takes place at the beginning of the whole process of data transformation. You select either whole records or parts of several records from the source systems. The task of selection usually forms part of the extraction function itself. However, Selection in some cases, the composition of the source structure may not be amenable to selection of the necessary parts during data extraction. In these cases, it is prudent to extract the whole record and then do the selection as part of the transformation function.

This task includes the types of data manipulation you need to perform on the selected parts of source records. Sometimes (uncommonly), you will be splitting the selected Splitting/joining parts even further during data transformation. Joining of parts selected from many source systems is more widespread in the data warehouse environment.

This is an all-inclusive task. It includes a large variety of rudimentary conversions of single fields for two primary reasons—one to standardise among the data Conversion extractions from disparate source systems, and the other to make the fields usable and understandable to the users.

Sometimes you may find that it is not feasible to keep data at the lowest level of detail in your data warehouse. It may be that none of your users ever need data at the lowest granularity for analysis or querying. For example, for a grocery chain, Summarisation sales data at the lowest level of detail for every transaction at the checkout may not be needed. Storing sales by product by store by day in the data warehouse may be quite adequate. So, in this case, the data transformation function includes summarisation of daily sales by product and by store.

This task is the rearrangement and simplification of individual fields to make them more useful for the data warehouse environment. You may use one or more fields Enrichment from the same input record to create a better view of the data for the data warehouse. This principle is extended when one or more fields originate from multiple records, resulting in a single field for the data warehouse.

Table 2.1 Basic tasks in data transformation

2.7.1 Major Transformation Types When you consider a particular set of extracted data structures, you will find that the transformation functions you need to perform on this set may done by doing a combination of the basic tasks discussed. Now let us consider specific types of transformation functions. These are the most common transformation types:

These revisions include changes to the data types and lengths of individual fields. In your source systems, product package types may be indicated by codes and names in which the fields are numeric and text data types. Again, Format Revisions the lengths of the package types may vary among the different source systems. It is wise to standardise and change the data type to text to provide values meaningful to the users.

This is also a common type of data transformation. When you deal with multiple source systems, you are bound to have the same data items described by a plethora of field values. The classic example is the coding for gender, with one source system using 1 and 2 for male and female and another system using Decoding of Fields M and F. Also, many legacy systems are notorious for using cryptic codes to represent business values. What do the codes AC, IN, RE, and SU mean in a customer file? You need to decode all such cryptic codes and change these into values that make sense to the users. Change the codes to Active, Inactive, Regular, and Suspended.

34/JNU OLE The extracted data from the sales system contains sales amounts, sales units, Calculated and and operating cost estimates by product. You will have to calculate the total Derived Values cost and the profit margin before data can be stored in the data warehouse. Average daily balances and operating ratios are examples of derived fields.

Earlier legacy systems stored names and addresses of customers and employees in large text fields. The first name, middle initials, and last name were stored as a large text in a single field. Similarly, some earlier systems stored city, Splitting of single state, and Zip Code data together in a single field. You need to store individual fields components of names and addresses in separate fields in your data warehouse for two reasons. First, you may improve the operating performance by indexing on individual components. Second, your users may need to perform analysis by using individual components such as city, state, and Zip Code.

This is not quite the opposite of splitting of single fields. This type of data transformation does not literally mean the merging of several fields to create a single field of data. For example, information about a product may come Merging of from different data sources. The product code and description may come from Information one data source. The relevant package types may be found in another data source. The cost data may be from yet another source. In this case, merging of information denotes the combination of the product code, description, package types, and cost into a single entity.

This type of data transformation relates to the conversion of character sets to an agreed standard character set for textual data in the data warehouse. If you have mainframe legacy systems as source systems, the source data from these Character set systems will be in EBCDIC characters. If PC-based architecture is the choice Conversion for your data warehouse, then you must convert the mainframe EBCDIC format to the ASCII format. When your source data is on other types of hardware and operating systems, you are faced with similar character set conversions.

Many companies today have global branches. Measurements in many European Conversion of Units countries are in metric units. If your company has overseas operations, you of Measurements may have to convert the metrics so that the numbers may all be in one standard unit of measurement.

This type relates to representation of date and time in standard formats. For example, the American and the British date formats may be standardised to Date/Time an international format. The date of October 11, 2000 is written as 10/11/2000 Conversion in the U.S. format and as 11/10/2000 in the British format. This date may be standardised to be written as 11 OCT 2000.

This type of transformation is the creating of summaries to be loaded in the data warehouse instead of loading the most granular level of data. For example, for a credit card company to analyse sales patterns, it may not be necessary to store Summarisation in the data warehouse every single transaction on each credit card. Instead, you may want to summarise the daily transactions for each credit card and store the summary data instead of storing the most granular data by individual transactions.

While extracting data from your input sources, look at the primary keys of the extracted records. You will have to come up with keys for the fact and dimension tables based on the keys in the extracted records. When choosing Key Reconstructing keys for your data warehouse database tables, avoid such keys with built-in meanings. Transform such keys into generic keys generated by the system itself. This is called key restructuring.

35/JNU OLE Data Mining

In many companies, the customer files have several records for the same customer. Mostly, the duplicates are the result of creating additional records by mistake. In your data warehouse, you want to keep a single record for one Duplication customer and link all the duplicates in the source systems to this single record. This process is called deduplication of the customer file. Employee files and, sometimes, product master files have this kind of duplication problem.

Table 2.2 Data transformation types

2.7.2 Data Integration and Consolidation The real challenge of ETL functions is the pulling together of all the source data from many disparate, dissimilar source systems. As of today, most of the data warehouses get data extracted from a combination of legacy mainframe systems, old minicomputer applications, and some newer client/server systems. Most of these source systems do not conform to the same set of business rules. Very often they follow different naming conventions and varied standards for data representation. Following figure shows a typical data source environment. Notice the challenging issues indicated in the figure.

MINI

Mainframe Unix

*Multiple character sets (EBCDIC/ASCII)* *Multiple data types* *Missing values* *No default values* *Multiple naming standards* *Conflicting business rules* *Incompatible structures* *Inconsistent values*

Fig. 2.12 Typical data source environment

Integrating the data is the combining of all the relevant operational data into coherent data structures to be made ready for loading into the data warehouse. You may need to consider data integration and consolidation as a type of pre-process before other major transformation routines are applied. You have to standardise the names and data representations and resolve discrepancies in the ways in which same data is represented in different source systems. Although time-consuming, many of the data integration tasks can be managed. However, let us go over a couple of more difficult challenges.

36/JNU OLE Entity Identification Problem If you have three different legacy applications developed in your organisation at different times in the past, you are likely to have three different customer files supporting those systems. One system may be the old order entry system, the second the customer service support system, and the third the marketing system. Most of the customers will be common to all three files. The same customer on each of the files may have a unique identification number. These unique identification numbers for the same customer may not be the same across the three systems. This is a problem of identification in which you do not know which of the customer records relate to the same customer. But in the data warehouse you need to keep a single record for each customer. You must be able to get the activities of the single customer from the various source systems and then match up with the single record to be loaded to the data warehouse. This is a common but very difficult problem in many enterprises where applications have evolved over time from the distant past. This type of problem is prevalent where multiple sources exist for the same entities. Vendors, suppliers, employees, and sometimes products are the kinds of entities that are prone to this type of problem.

In the above example of the three customer files, you have to design complex algorithms to match records from all the three files and form groups of matching records. No matching algorithm can completely determine the groups. If the matching criteria are too tight, then some records will escape the groups. On the other hand, if the matching criteria are too loose, a particular group may include records of more than one customer. You need to get your users involved in reviewing the exceptions to the automated procedures. You have to weigh the issues relating to your source systems and decide how to handle the entity identification problem. Every time a data extract function is performed for your data warehouse, which may be every day, do you pause to resolve the entity identification problem before loading the data warehouse? How will this affect the availability of the data warehouse to your users? Some companies, depending on their individual situations, take the option of solving the entity identification problem in two phases. In the first phase, all records, irrespective of whether they are duplicates or not, are assigned unique identifiers. The second phase consists of reconciling the duplicates periodically through automatic algorithms and manual verification.

Multiple Source Problem This is another kind of problem affecting data integration, although less common and complex than the entity identification problem. This problem results from a single data element having more than one source. For example, suppose unit cost of products is available from two systems. In the standard costing application, cost values are calculated and updated at specific intervals. Your order processing system also carries the unit costs for all products. There could be slight variations in the cost figures from these two systems.

You need to know from which system you should get the cost for storing in the data warehouse. For the same, a straightforward solution is to assign a higher priority to one of the two sources and pick up the product unit cost from that source. Sometimes, a straightforward solution such as this may not sit well with needs of the data warehouse users. You may have to select from either of the files based on the last update date. Or, in some other instances, your determination of the appropriate source depends on other related fields.

2.7.3 Implementing Transformation The complexity and the extent of data transformation strongly suggest that manual methods alone will not be enough. You must go beyond the usual methods of writing conversion programs when you deployed operational systems. The types of data transformation are by far more difficult and challenging.

The methods you may want to adopt depend on some significant factors. If you are considering automating most of the data transformation functions, first consider if you have the time to select the tools, configure and install them, train the project team on the tools, and integrate the tools into the data warehouse environment. Data transformation tools can be expensive. If the scope of your data warehouse is modest, then the project budget may not have room for transformation tools.

In many cases, a suitable combination of both methods will prove to be effective. Find the proper balance based on the available time frame and the money in the budget.

37/JNU OLE Data Mining

Using Transformation Tools In recent years, transformation tools have greatly increased in functionality and flexibility. Although the desired goal for using transformation tools is to eliminate manual methods altogether, in practice this is not completely possible. Even if you get the most sophisticated and comprehensive set of transformation tools, be prepared to use in-house programs here and there.

Use of automated tools certainly improves efficiency and accuracy. As a data transformation specialist, you just have to specify the parameters, the data definitions, and the rules to the transformation tool. If your input into the tool is accurate, then the rest of the work is performed efficiently by the tool. You gain a major advantage from using a transformation tool because of the recording of metadata by the tool. When you specify the transformation parameters and rules, these are stored as metadata by the tool. This metadata then becomes part of the overall metadata component of the data warehouse. It may be shared by other components. When changes occur to transformation functions because of changes in business rules or data definitions, you just have to enter the changes into the tool. The metadata for the transformations get automatically adjusted by the tool.

Using Manual Techniques This was the predominant method until recently when transformation tools began to appear in the market. Manual techniques may still be adequate for smaller data warehouses. Here, manually coded programs and scripts perform every data transformation. Mostly, these programs are executed in the data staging area. The analysts and programmers who already possess the knowledge and the expertise are able to produce the programs and scripts. This method involves elaborate coding and testing. Although the initial cost may be reasonable, ongoing maintenance may escalate the cost. Unlike automated tools, the manual method is more likely to be prone to errors. It may also turn out that several individual programs are required in your environment.

A major disadvantage relates to metadata. Automated tools record their own metadata, but in-house programs have to be designed differently if you need to store and use metadata. Even if the in-house programs record the data transformation metadata initially, each time changes occur to transformation rules, which the metadata has to be maintained. This puts an additional burden on the maintenance of the manually coded transformation programs.

2.8 Data Loading It is generally agreed that transformation functions end as soon as load images are created. The next major set of functions consists of the ones that take the prepared data, apply it to the data warehouse, and store it in the database there. You create load images to correspond to the target files to be loaded in the data warehouse database.

The whole process of moving data into the data warehouse repository is referred to in several ways. You must have heard the phrases applying the data, loading the data, and refreshing the data. For the sake of clarity we will use the phrases as indicated below: • Initial Load—populating all the data warehouse tables for the very first time • Incremental Load—applying ongoing changes as necessary in a periodic manner • Full Refresh—completely erasing the contents of one or more tables and reloading with fresh data (initial load is a refresh of all the tables)

As loading the data warehouse may take an inordinate amount of time, loads are generally caused for great concern. During the loads, the data warehouse has to be offline. You need to find a window of time when the loads may be scheduled without affecting your data warehouse users. Therefore, consider dividing up the whole load process into smaller chunks and populating a few files at a time. This will give you two benefits such as you may be able to run the smaller loads in parallel and you might also be able to keep some parts of the data warehouse up and running while loading the other parts. It is hard to estimate the running times of the loads, especially the initial load or a complete refresh. Do test loads to verify the correctness and to estimate the running times.

38/JNU OLE 2.9 Data Quality Accuracy is associated with a data element. Consider an entity such as customer. The customer entity has attributes such as customer name, customer address, customer state, customer lifestyle, and so on. Each occurrence of the customer entity refers to a single customer. Data accuracy, as it relates to the attributes of the customer entity, means that the values of the attributes of a single occurrence accurately describe the particular customer. The value of the customer name for a single occurrence of the customer entity is actually the name of that customer. Data quality implies data accuracy, but it is much more than that. Most cleansing operations concentrate on data accuracy only. You need to go beyond data accuracy. If the data is fit for the purpose for which it is intended, we can then say such data has quality. Therefore, data quality is to be related to the usage for the data item as defined by the users. Does the data item in an entity reflect exactly what the user is expecting to observe? Does the data item possess fitness of purpose as defined by the users? If it does, the data item conforms to the standards of data quality.

If the database records conform to the field validation edits, then we generally say that the database records are of good data quality. But such single field edits alone do not constitute data quality. Data quality in a data warehouse is not just the quality of individual data items but the quality of the full, integrated system as a whole. It is more than the data edits on individual fields. For example, while entering data about the customers in an order entry application, you may also collect the demographics of each customer. The customer demographics are not germane to the order entry application and, therefore, they are not given too much attention. But you run into problems when you try to access the customer demographics in the data warehouse the customer data as an integrated whole lacks data quality.

The following list is a survey of the characteristics or indicators of high-quality data.

The value stored in the system for a data element is the right value for that occurrence of the data element. If you have a customer name and an address stored in a record, then the address is the Accuracy correct address for the customer with that name. If you find the quantity ordered as 1000 units in the record for order number 12345678, then that quantity is the accurate quantity for that order. The data value of an attribute falls in the range of allowable, Domain Integrity defined values. The common example is the allowable values being “male” and “female” for the gender data element. Value for a data attribute is actually stored as the data type defined for that attribute. When the data type of the store name field is Data Type defined as “text,” all instances of that field contain the store name shown in textual format and not in numeric codes. The form and content of a data field is the same across multiple source systems. If the product code for product ABC in one Consistency system is 1234, then the code for this product must be 1234 in every source system. The same data must not be stored in more than one place in a system. In case, for reasons of efficiency, a data element is Redundancy intentionally stored in more than one place in a system, then the redundancy must be clearly identified. Completeness There are no missing values for a given attribute in the system. Duplication of records in a system is completely resolved. If the product file is known to have duplicate records, then all the Duplication duplicate records for each product are identified and a cross- reference created.

39/JNU OLE Data Mining

The values of each data item adhere to prescribed business rules. In an auction system, the hammer or sale price cannot be less Conformance to Business Rules than the reserve price. In a bank loan system, the loan balance must always be positive or zero. Wherever a data item can naturally be structured into individual Structural Definiteness components, the item must contain this well-defined structure. A field must be used only for the purpose for which it is defined. If the field Address-3 is defined for any possible third line of Data Anomaly address for long addresses, then this field must be used only for recording the third line of address. It must not be used for entering a phone or fax number for the customer. A data element may possess all the other characteristics of quality data but if the users do not understand its meaning clearly, Clarity then the data element is of no value to the users. Proper naming conventions help to make the data elements well understood by the users. The users determine the timeliness of the data. If the users expect customer dimension data not to be older than one day, the changes Timely to customer data in the source systems must be applied to the data warehouse daily. Every data element in the data warehouse must satisfy some requirements of the collection of users. A data element may be Usefulness accurate and of high quality, but if it is of no value to the users, then it is totally unnecessary for that data element to be in the data warehouse. The data stored in the relational databases of the source systems must adhere to entity integrity and referential integrity rules. Any table that permits null as the primary key does not have Adherence to Rules entity integrity. Referential integrity forces the establishment of the parent–child relationships correctly. In a customer-to- order relationship, referential integrity ensures the existence of a customer for every order in the database.

Table 2.3 Characteristics or indicators of high-quality data

2.10 Information Access and Delivery You have extracted and transformed the source data. You have the best data design for the data warehouse repository. You have applied the most effective data cleansing methods and got rid of most of the pollution from the source data. Using the most optimal methods, you have loaded the transformed and cleansed data into your data warehouse database. After performing all of these tasks most effectively, if your team has not provided the best possible mechanism for information delivery to your users, you have really accomplished nothing from the users’ perspective. As you know, the data warehouse exists for one reason and one reason alone. It’s function is to provide strategic information to your users. For the users, the information delivery mechanism is the data warehouse. The user interface for information is what determines the ultimate success of your data warehouse. If the interface is intuitive, easy to use, and enticing, the users will keep coming back to the data warehouse. If the interface is difficult to use, cumbersome, and convoluted, your project team may as well leave the scene.

2.11 Matching Information to Classes of Users OLAP in Data Warehouse The users in enterprises make use of the information from the operational systems to perform their day-to-day work and run the business. If we have been involved in information delivery from operational systems and we understand what information delivery to the users entails, then what is the need for this special study on information delivery from the data warehouse? Let us review how information delivery from a data warehouse differs from information

40/JNU OLE delivery from an operational system. If the kinds of strategic information made available in a data warehouse were readily available from the source systems, then we would not really need the warehouse. Data warehousing enables the users to make better strategic decisions by obtaining data from the source systems and keeping it in a format suitable for querying and analysis.

2.11.1 Information from the Data Warehouse You must have worked on different types of operational systems that provide information to users. The users in enterprises make use of the information from the operational systems to perform their day-to-day work and run the business. If we have been involved in information delivery from operational systems and we understand what information delivery to the users entails, then what is the need for this special study on information delivery from the data warehouse?

If the kinds of strategic information made available in a data warehouse were readily available from the source systems, then we would not really need the warehouse. Data warehousing enables the users to make better strategic decisions by obtaining data from the source systems and keeping it in a format suitable for querying and analysis.

2.11.2 Information Potential It is necessary to gain an appreciation of the enormous information potential of the data warehouse. Because of this great potential, we have to pay adequate attention to information delivery from the data warehouse. We cannot treat information delivery in a special way unless we fully realise the significance of how the data warehouse plays a key role in the overall management of an enterprise.

Overall Enterprise Management In every enterprise, three sets of processes govern the overall management. First, the enterprise is engaged in planning. Secondly, execution of the plans takes place, followed by. assessment of the results of the execution. Following figure indicates these plan–executive–assess processes.

Data warehouse helps in Plan planning Enhance marketing campaigns campaigns based in Planning results

Data Execution warehouse Assessment helps assess results Execute Assess marketing result of campaign campaign Fig. 2.13 Enterprise plan-execute-assess closed loop

Assessment of the results determines the effectiveness of the campaigns. Based on the assessment of the results, more plans may be made to vary the composition of the campaigns or launch additional ones. The cycle of planning, executing, and assessing continues.

41/JNU OLE Data Mining

It is very interesting to note that the data warehouse, with its specialised information potential, fits nicely in this plan–execute–assess loop. The data warehouse reports on the past and helps to plan the future. Initially, the data warehouse assists in the planning. Once the plans are executed, the data warehouse is used to assess the effectiveness of the execution.

Information potential for business areas We considered one isolated example of how the information potential of your data warehouse can assist in the planning for a market expansion and in the assessment of the results of the execution of marketing campaigns for that purpose. Following are a few general areas of the enterprise where the data warehouse can assist in the planning and assessment phases of the management loop.

To increase profits, management has to understand how the profits are tied to product lines, markets, and services. Management must gain insights into which product lines and Profitability Growth markets produce greater profitability. The information from the data warehouse is ideally suited to plan for profitability growth and to assess the results when the plans are executed. Strategic marketing drives business growth. When management studies the opportunities for up-selling and cross-selling to Strategic Marketing existing customers and for expanding the customer base, they can plan for business growth. The data warehouse has great information potential for strategic marketing. A customer’s interactions with an enterprise are captured in various operational systems. The order processing system contains the orders placed by the customer; the product shipment system, the shipments; the sales system, the details of the products sold to the customer; the accounts receivable system, the credit details and the outstanding balances. The Customer Relationship Management data warehouse has all the data about the customer extracted from the various disparate source systems, transformed, and integrated. Thus, your management can “know” their customers individually from the information available in the data warehouse. This knowledge results in better customer relationship management. The management can get the overall pictures of corporate-wide purchasing patterns from data warehouse. This is where all data Corporate Purchasing about products and vendors are collected after integration from the source systems. Your data warehouse empowers corporate management to plan for streamlining purchasing processes. The various operational systems collect massive quantities of data on numerous types of business transactions. But these Realising the Information Potential operational systems are not directly helpful for planning and assessment of results. The users need to assess the results by viewing the data in the proper business context.

Table 2.4 General areas where data warehouse can assist in the planning and assessment phases

42/JNU OLE Summary • Data design consists of putting together the data structures. A group of data elements form a data structure. • Logical data design includes determination of the various data elements that are needed and combination of the data elements into structures of data. Logical data design also includes establishing the relationships among the data structures. • Many case tools are available for data modelling. These tools can be used for creating the logical schema and the physical schema for specific target database management systems (DBMS). • Another very useful function found in the case tools is the ability to forward-engineer the model and generate the schema for the target database system you need to work with. • Creating the STAR schema is the fundamental data design technique for the data warehouse. It is necessary to gain a good grasp of this technique. • Dimensional modelling gets its name from the business dimensions we need to incorporate into the logical data model. • The multidimensional information package diagram we have discussed is the foundation for the dimensional model. • Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are described by common aspects of the business. • Source identification encompasses the identification of all the proper data sources. It does not stop with just the identification of the data sources. • Business transactions keep changing the data in the source systems. • Operational data in the source system may be thought of as falling into two broad categories. • Irrespective of the variety and complexity of the source operational systems, and regardless of the extent of your data warehouse, you will find that most of your data transformation functions break down into a few basic tasks. • The whole process of moving data into the data warehouse repository is referred to in several ways. You must have heard the phrases applying the data, loading the data, and refreshing the data.

References • Han, J. and Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd ed., Diane Cerra. • Kimball, R., 2006. The Data warehouse Lifecycle Toolkit, Wiley-India. • Mento, B. and Rapple, B., 2003. Data Mining and Warehousing [Online] Available at: . [Accessed 9 September 2011]. • Orli, R and Santos, F., 1996. Data Extraction, Transformation, and Migration Tools [Online] Available at: . [Accessed 9 September 2011]. • Learndatavault, 2009. Business Data Warehouse (BDW) [Video Online] Available at: . [Accessed 12 September 2011]. • SQLUSA, 2009. SQLUSA.com Data Warehouse and OLAP [Video Online] Available at : < http://www.youtube. com/watch?v=OJb93PTHsHo>. [Accessed 12 September 2011].

Recommended Reading • Prabhu, C. S. R., 2004. Data warehousing: concepts, techniques, products and applications, 2nd ed., PHI Learning Pvt. Ltd. • Ponniah, P., 2001. Data Warehousing Fundamentals-A Comprehensive Guide for IT Professionals, Wiley- Interscience Publication. • Ponniah, P., 2010. Data Warehousing Fundamentals for IT Professionals, 2nd ed., John Wiley and Sons.

43/JNU OLE Data Mining

Self Assessment 1. ______consists of putting together the data structures. a. Data mining b. Data design c. Data warehousing d. Metadata

2. Match the columns. A. Selecting the subjects from the information pack- 1. Choosing the process ages for the first set of logical structures to be de- signed. B. Determining the level of detail for the data in the 2. Choosing the gain data structures. C. Selecting the metrics or units of measurements to 3. Choosing the facts be included in the first set of structures. 4. Choosing the durations of the D. Determining how far back in time you should go database for historical data. a. 1-A, 2-B, 3-C, 4-D b. 1-D, 2-A, 3-C, 4-B c. 1-B, 2-A, 3-C, 4-D d. 1-C, 2-B, 3-A, 4-D

3. Which of the following is used to define the tables, the attributes and the relationships? a. Metadata b. Data warehousing c. Data design d. Case tools

4. Creating the ______is the fundamental data design technique for the data warehouse. a. STAR schema b. Data transformation c. Dimensional modelling d. Data extraction

5. Each row in a dimension table is identified by a unique value of an attribute designated as the ______of the dimension. a. ordinary key b. primary key c. surrogate key d. foreign key

6. How many general principles are to be applied when choosing primary keys for dimension tables? a. One b. Two c. Three d. Four

44/JNU OLE 7. Which of the following keys are simply system-generated sequence numbers? a. Ordinary key b. Primary key c. Surrogate key d. Foreign key

8. Which of the following is a logical design technique to structure the business dimensions and the metrics that are analysed along these dimensions? a. Mapping b. Data extraction c. Dimensional modelling d. E-R modelling

9. Which technique is adopted to create the data models for these systems? a. E-R modelling b. Dimensional modelling c. Source identification d. Data extraction

10. ______in the source systems are said to be time-dependent or temporal. a. Data b. Data mining c. Data warehousing d. Mapping

45/JNU OLE Data Mining

Chapter III Data Mining

Aim

The aim of this chapter is to:

• introduce the concept of data mining

• analyse different data mining techniques

• explore the crucial concepts of data mining

Objectives

The objectives of this chapter are to:

• explicate cross-industry standard process

• highlight the dimensional modelling

• describe the process of graph mining

• elucidate social network analysis

Learning outcome

At the end of this chapter, you will be able to:

• discuss multirelational data mining

• comprehend data mining algorithms

• understand classification, clustering and association rules

46/JNU OLE 3.1 Introduction Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data. The interesting patterns can be used to make predictions. The process of data mining is composed of several steps including selecting data to analyse, preparing the data, applying the data mining algorithms, and then interpreting and evaluating the results. Sometimes, the term, data mining, refers to the step in which the data mining algorithms are applied. This has created a fair amount of confusion in the literature. But more often the term is used to refer the entire process of finding and using interesting patterns in data.

The application of data mining techniques was first applied to databases. A better term for this process is KDD (Knowledge Discovery in Databases). Benoît (2002) offers this definition of KDD (which he refers to as data mining): Data mining (DM) is a multistage process of extracting previously unanticipated knowledge from large databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer associations and rules from them. The extracted information may then be applied to prediction or classification models by identifying relations within the data records or between databases. Those patterns and rules can then guide decision making and forecast the effects of those decisions.

Data mining techniques can be applied to a wide variety of data repositories including databases, data warehouses, spatial data, multimedia data, Internet or Web-based data and complex objects. A more appropriate term for describing the entire process would be knowledge discovery, but unfortunately the term data mining is what has caught on.

The following figure shows data mining as a step in an iterative knowledge discovery process.

Pattern Evaluation Knowledge ing Data Min Task-relevant Data

Data warehouse Selection and transformation

Data Cleaning

Data integration

Database

Fig. 3.1 Data mining is the core of knowledge discovery process (Source: http://www.exinfm.com/pdffiles/intro_dm.pdf)

The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: • Data cleaning: This is also known as data cleansing. This is a phase in which noise data and irrelevant data are removed from the collection. • Data integration: At this stage, multiple data sources, often heterogeneous, may be combined in a common source. • Data selection: At this step, the data relevant to the analysis is decided on and retrieved from the data collection.

47/JNU OLE Data Mining

• Data transformation: This is also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. • Data mining: It is the crucial step in which clever techniques are applied to extract patterns potentially useful. • Pattern evaluation: In this step, strictly interesting patterns representing knowledge are identified based on given measures. • Knowledge representation: This is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualisation techniques to help users understand and interpret the data mining results.

It is common to combine some of these steps together. For instance, data cleaning and data integration can be performed together as a pre-processing phase to generate a data warehouse. Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data warehouses, the selection is done on transformed data.

The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.

Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein of valuable ore. Both imply either sifting through a large amount of material or ingeniously probing the material to exactly pinpoint where the values reside. It is, however, a misnomer, since mining for gold in rocks is usually called “gold mining” and not “rock mining”, thus, by analogy, data mining should have been called “knowledge mining” instead. Nevertheless, data mining became the accepted customary term, and very rapidly a trend that even overshadowed more general terms such as knowledge discovery in databases (KDD) that describe a more complete process. Other similar terms referring to data mining are: data dredging, knowledge extraction and pattern discovery.

The ongoing remarkable growth in the field of data mining and knowledge discovery has been fuelled by a fortunate confluence of a variety of factors: • The explosive growth in data collection, as exemplified by the supermarket scanners above • The storing of the data in data warehouses, so that the entire enterprise has access to a reliable current database • The availability of increased access to data from Web navigation and intranets • The competitive pressure to increase market share in a globalised economy • The development of off-the-shelf commercial data mining software suites • The tremendous growth in computing power and storage capacity

3.2 Crucial Concepts of Data Mining Some crucial concepts of data mining are explained below.

3.2.1 Bagging (Voting, Averaging) The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly sub-sample (with replacement) from the dataset, and apply, for example, a tree classifier (for example, C&RT and CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets. One method of deriving a single prediction (for new observations) is to use all

48/JNU OLE trees found in the different samples, and to apply some simple voting. The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure.

3.2.2 Boosting The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification.

A simple algorithm for boosting works like this: Start by applying some method (For example, a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an “expert” in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (example, via voting, or some weighted voting procedure) to derive a single best prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).

3.2.3 Data Preparation (in Data Mining) Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying “garbage-in-garbage-out” is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (example, via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (example, Income: -100), impossible data combinations (example, Gender: Male, Pregnant: Yes), and likewise. Analysing data that has not been carefully screened for such problems can produce highly misleading results, in particular in predictive data mining.

3.2.4 Data Reduction (for Data Mining) The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, and so on.

3.2.5 Deployment The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, we usually want to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (for example, neural networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent.

49/JNU OLE Data Mining

3.2.6 Drill-Down Analysis The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (efor example Gender, geographic region, and so on). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next, we may want to “drill-down” to expose and further analyse the data “underneath” one of the categorisations, for example, we might want to further review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (for example; income, age, and so on). At the lowest (“bottom”) level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, and so on, and to offer to those customers some particular services of particular utility to that group.

3.2.7 Feature Selection One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerised) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalised linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables.

Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification.

3.2.8 Machine Learning Machine learning, computational learning theory and similar terms are often used in the context of data mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the “models” or techniques that are used to generate the prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, and so on. These methods usually involve the fitting of very complex “generic” models that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in cross-validation samples.

3.2.9 Meta-Learning The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalisation).

3.2.10 Models for Data Mining In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organisation. In the data mining literature, various “general frameworks” have been proposed to serve as blueprints for how to organise the process of gathering data, analysing data, disseminating results, implementing results, and monitoring improvements.

50/JNU OLE One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects:

Business Understanding Data Understanding

Data Preparation Modeling

Evaluation

Deployment

Fig. 3.2 Steps for data mining projects

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favour worldwide. It postulated a sequence of, so-called, DMAIC steps that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including “production of services,” that is service industries).

Define Measure Analyse Improve Control

Fig. 3.3 Six-sigma methodology

Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA, which is focusing more on the technical activities typically involved in a data mining project.

Sample Explore Modify Model Assess

Fig. 3.4 SEMMA

All of these models are concerned with the process of how to integrate data mining methodology into an organisation, how to “convert data into information,” how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.

51/JNU OLE Data Mining

3.2.11 Predictive Data Mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (for example, neural networks, meta-learner) that can quickly identify transactions, which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (for example, to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining.

3.2.12 Text Mining While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (for example, critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, and so on and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (for example, to determine co-occurrences of concepts, key phrases, names, addresses, product names, and so on.).

3.3 Cross-Industry Standard Process: Crisp–Dm There is a temptation in some companies, due to departmental inertia and compartmentalisation, to approach data mining haphazardly, to reinvent the wheel and duplicate effort. A cross-industry standard was clearly required that is industry-neutral, tool-neutral, and application-neutral. The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in 1996 by analysts representing DaimlerChrysler, SPSS, and NCR. CRISP provides a non-proprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unit.

According to CRISP–DM, a given data mining project has a life cycle consisting of six phases, as illustrated in the fig. 3.5. Note that the phase sequence is adaptive. That is, the next phase in the sequence often depends on the outcomes associated with the preceding phase. The most significant dependencies between phases are indicated by the arrows. For example, suppose that we are in the modelling phase. Depending on the behaviour and characteristics of the model, we may have to return to the data preparation phase for further refinement before moving forward to the model evaluation phase.

The iterative nature of CRISP is symbolised by the outer circle in the figure 3.5. Often, the solution to a particular business or research problem leads to further questions of interest, which may then be attacked using the same general process as before.

52/JNU OLE Business/Research Data understanding Understanding phase phase

Deployment phase Data preparation phase

Evaluation phase Modeling phase

Fig. 3.5 CRISP–DM is an iterative, adaptive process

Following is an outline of each phase. Although conceivably, issues encountered during the evaluation phase can send the analyst back to any of the previous phases for amelioration, for simplicity we show only the most common loop, back to the modelling phase.

3.3.1 CRISP-DM: The Six Phases The six phases of CRISP-DM are explained below:

Phases Explanation The first phase in the CRISP–DM standard process may also be termed the research understanding phase. • Enunciate the project objectives and requirements clearly in terms of the business Business understanding phase or research unit as a whole. • Translate these goals and restrictions into the formulation of a data mining problem definition. • Prepare a preliminary strategy for achieving these objectives.

53/JNU OLE Data Mining

• Collect the data. • Use exploratory data analysis to familiarise yourself with the data and discover initial Data understanding phase insights. • Evaluate the quality of the data. • If desired, select interesting subsets that may contain actionable patterns. • Prepare from the initial raw data the final data set that is to be used for all subsequent phases. This phase is very labour intensive. • Select the cases and variables you want to analyse and that are appropriate for your Data preparation phase analysis. • Perform transformations on certain variables, if needed. • Clean the raw data so that it is ready for the modelling tools. • Select and apply appropriate modelling techniques. • Calibrate model settings to optimise results. • Remember that often, several different Modelling phase techniques may be used for the same data mining problem. • If necessary, loop back to the data preparation phase to bring the form of the data into line with the specific requirements of a particular data mining technique. • Evaluate the one or more models delivered in the modelling phase for quality and effectiveness before deploying them for use in the field. • Determine whether the model in fact achieves the objectives set for it in the first phase. Evaluation phase • Establish whether some important facet of the business or research problem has not been accounted for sufficiently. • Come to a decision regarding use of the data mining results.

54/JNU OLE • Make use of the models created: Model creation does not signify the completion of a project. • Example of a simple deployment: Generate a report. Deployment phase • Example of a more complex deployment: Implement a parallel data mining process in another department. • For businesses, the customer often carries out the deployment based on your model.

Table 3.1 The six phases of CRISP-DM

3.4 Data Mining Techniques As a general data structure, graphs have become increasingly important in modelling sophisticated structures and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision, video indexing, text retrieval, and Web analysis. Mining frequent sub graph patterns for further characterisation, discrimination, classification, and cluster analysis becomes an important task. Moreover, graphs that link many nodes together may form different kinds of networks, such as telecommunication networks, computer networks, biological networks, and Web and social community networks.

As such networks have been studied extensively in the context of social networks, their analysis has often been referred to as social network analysis. Furthermore, in a relational database, objects are semantically linked across multiple relations. Mining in a relational database often requires mining across multiple interconnected relations, which is similar to mining in connected graphs or networks. Such kind of mining across data relations is considered multirelational data mining. Data mining techniques can be classified in the following diagram.

Graph Mining

Data Mining Techniques

Social Multirelational Network Data Mining Analysis

Fig. 3.6 Data mining techniques

3.5 Graph Mining Graphs become increasingly important in modelling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, social networks, the Web, workflows, and XML documents. Many graph search algorithms have been developed in chemical informatics, computer vision, video indexing, and text retrieval. With the increasing demand on the analysis of large amounts of structured data, graph mining has become

55/JNU OLE Data Mining

an active and important theme in data mining. Among the various kinds of graph patterns, frequent substructures are the very basic patterns that can be discovered in a collection of graphs. They are useful for characterising graph sets, discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating similarity search in graph databases.

Recent studies have developed several graph mining methods and applied them to the discovery of interesting patterns in various applications. For example, there are reports on the discovery of active chemical structures in HIV-screening datasets by contrasting the support of frequent graphs between different classes. There have been studies on the use of frequent structures as features to classify chemical compounds, on the frequent graph mining technique to study protein structural families, on the detection of considerably large frequent subpathways in metabolic networks, and on the use of frequent graph patterns for graph indexing and similarity search in graph databases. Although graph mining may include mining frequent subgraph patterns, graph classification, clustering, and other analysis tasks, in this section we focus on mining frequent subgraphs. Following figure explains the methods for Mining Frequent Subgraphs.

Methods for Mining Frequent Subgraphs

Apriori-based Pattern-Growth Approach Approach

Fig. 3.7 Methods of mining frequent subgraphs

3.6 Social Network Analysis From the point of view of data mining, a social network is a heterogeneous and multirelational data set represented by a graph. The graph is typically very large, with nodes corresponding to objects and edges corresponding to links representing relationships or interactions between objects. Both nodes and links have attributes. Objects may have class labels. Links can be one-directional and are not required to be binary. Social networks need not be social in context. There are many real-world instances of technological, business, economic, and biological social networks.

Examples include electrical power grids, telephone call graphs, the spread of computer viruses, the World Wide Web, and co-authorship and citation networks of scientists. Customer networks and collaborative filtering problems (where product recommendations are based on the preferences of other customers) are other examples. In biology, examples range from epidemiological networks, cellular and metabolic networks, and food webs, to the neural network of the nematode worm Caenorhabditis elegans (the only creature whose neural network has been completely mapped). The exchange of e-mail messages within corporations, newsgroups, chat rooms, friendships, sex webs (linking sexual partners), and the quintessential “old-boy” network (that is the overlapping boards of directors of the largest companies in the United States) are examples from sociology.

3.6.1 Characteristics of Social Networks Social networks are rarely static. Their graph representations evolve as nodes and edges are added or deleted over time. In general, social networks tend to exhibit the following characteristics:

56/JNU OLE • Densification power law: Previously, it was believed that as a network evolves, the number of degrees grows linearly in the number of nodes. This was known as the constant average degree assumption. However, extensive experiments have shown that, on the contrary, networks become increasingly dense over time with the average degree increasing (and hence, the number of edges growing super linearly in the number of nodes). The densification follows the densification power law (or growth power law), which states,

• where e(t) and n(t), respectively, represent the number of edges and nodes of the graph at time t, and the exponent a generally lies strictly between 1 and 2. Note that if a = 1, this corresponds to constant average degree over time, whereas a = 2 corresponds to an extremely dense graph where each node has edges to a constant fraction of all nodes. • Shrinking diameter: It has been experimentally shown that the effective diameter tends to decrease as the network grows. This contradicts an earlier belief that the diameter slowly increases as a function of network size decreases. As an intuitive example, consider a citation network, where nodes are papers and a citation from one paper to another is indicated by a directed edge. The out-links of a node, v (representing the papers cited by v), are “frozen” at the moment it joins the graph. The decreasing distances between pairs of nodes consequently appears to be the result of subsequent papers acting as “bridges” by citing earlier papers from other areas. • Heavy-tailed out-degree and in-degree distributions: The number of out-degrees for a node tends to follow a heavy-tailed distribution by observing the power law, 1/na, where n is the rank of the node in the order of decreasing out-degrees and typically, 0 < a < 2. The smaller the value of a, the heavier the tail. This phenomena is represented in the preferential attachment model, where each new node attaches to an existing network by a constant number of out-links, following a “rich-get-richer” rule. The in-degrees also follow a heavy-tailed distribution, although it tends be more skewed than the out-degrees distribution. Node out-degress

Node rank Fig. 3.8 Heavy-tailed out-degree and in-degree distributions

The number of out-degrees (y-axis) for a node tends to follow a heavy-tailed distribution. The node rank (x-axis) is defined as the order of deceasing out-degrees of the node.

3.6.2 Mining on Social Networks Following are exemplar areas of mining on social networks, namely, link prediction, mining customer networks for viral marketing, mining newsgroups using networks, and community mining from multirelational networks.

Link prediction: What edges will be added to the network? Approaches to link prediction have been proposed based on several measures for analysing the “proximity” of nodes in a network. Many measures originate from techniques in graph theory and social network analysis. The general methodology is as follows: All methods assign a connection weight, score(X, Y), to pairs of nodes, X and Y, based

57/JNU OLE Data Mining

on the given proximity measure and input graph, G. A ranked list in decreasing order of score(X, Y) is produced. This gives the predicted new links in decreasing order of confidence. The predictions can be evaluated based on real observations on experimental data sets. The simplest approach ranks pairs, (X, Y), by the length of their shortest path in G. This embodies the small world notion that all individuals are linked through short chains. (Since the convention is to rank all pairs in order of decreasing score, here, and score (X, Y) is defined as the negative of the shortest path length.) Several measures use neighbourhood information. The simplest such measure is common neighbours—the greater the number of neighbours that X and Y have in common, the more likely X and Y are to form a link in the future. Intuitively, if authors X and Y have never written a paper together but have many colleagues in common, the more likely they are to collaborate in the future. Other measures are based on the ensemble of all paths between two nodes. The Katz measure, for example, computes a weighted sum over all paths between X and Y, where shorter paths are assigned heavier weights. All of the measures can be used in conjunction with higher-level approaches, such as clustering. For instance, the link prediction method can be applied to a cleaned-up version of the graph, in which spurious edges have been removed.

Mining customer networks for viral marketing Viral marketing is an application of social network mining that explores how individuals can influence the buying behaviour of others. Traditionally, companies have employed direct marketing (where the decision to market to a particular individual is based solely on the characteristics) or mass marketing (where individuals are targeted based on the population segment to which they belong). These approaches, however, neglect the influence that customers can have on the purchasing decisions of others.

For example, consider a person who decides to see a particular movie and persuades a group of friends to see the same film. Viral marketing aims to optimise the positive word-of-mouth effect among customers. It can choose to spend more money marketing to an individual, if that person has many social connections. Thus, by considering the interactions between customers, viral marketing may obtain higher profits than traditional marketing, which ignores such interactions.

The growth of the Internet over the past two decades has led to the availability of many social networks that can be mined for the purposes of viral marketing. Examples include e-mail mailing lists, UseNet groups, on-line forums, instant relay chat (IRC), instant messaging, collaborative filtering systems, and knowledge-sharing sites. Knowledge sharing sites allow users to offer advice or rate products to help others, typically for free. Users can rate the usefulness or “trustworthiness” of a review, and may possibly rate other reviewers as well. In this way, a network of trust relationships between users (known as a “web of trust”) evolves, representing a social network for mining.

Mining newsgroups using networks The situation is rather different in newsgroups on topic discussions. A typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. Such quoted responses form “quotation links” and create a network in which the vertices represent individuals and the links “responded-to” relationships.

An interesting phenomenon is that people more frequently respond to a message when they disagree than when they agree. This behaviour exists in many newsgroups and is in sharp contrast to the Web page link graph, where linkage is an indicator of agreement or common interest. Based on this behaviour, one can effectively classify and partition authors in the newsgroup into opposite camps by analysing the graph structure of the responses.

This newsgroup classification process can be performed using a graph-theoretic approach. The quotation network (or graph) can be constructed by building a quotation link between person i and person j. If i has quoted from an earlier posting written by j, we can consider any bipartition of the vertices into two sets: F represents those for an issue and A represents those against it. If most edges in a newsgroup graph represent disagreements, then the optimum choice is to maximise the number of edges across these two sets. Because it is known that theoretically the max-cut problem (that is maximising the number of edges to cut so that a graph is partitioned into two disconnected subgraphs) is an NP-hard problem, we need to explore some alternative, practical solutions.

58/JNU OLE In particular, we can exploit two additional facts that hold in our situation: (1) rather than being a general graph, our instance is largely a bipartite graph with some noise edges added, and (2) neither side of the bipartite graph is much smaller than the other. In such situations, we can transform the problem into a minimum-weight, approximately balanced cut problem, which in turn can be well approximated by computationally simple spectral methods. Moreover, to further enhance the classification accuracy, we can first manually categorise a small number of prolific posters and tag the corresponding vertices in the graph. This information can then be used to bootstrap a better overall partitioning by enforcing the constraint that those classified on one side by human effort should remain on that side during the algorithmic partitioning of the graph.

Based on these ideas, an efficient algorithm was proposed. Experiments with some newsgroup data sets on several highly debatable social topics, such as abortion, gun control, and immigration, demonstrate that links carry less noisy information than text. Methods based on linguistic and statistical analysis of text yield lower accuracy on such newsgroup data sets than that based on the link analysis shown earlier. This is because the vocabulary used by the opponent sides tends to be largely identical, and many newsgroup postings consist of too-brief text to facilitate reliable linguistic analysis.

Community mining from multirelational networks With the growth of the Web, community mining has attracted increasing attention. A great deal of such work has focused on mining implicit communities of Web pages, of scientific literature from the Web, and of document citations. In principle, a community can be defined as a group of objects sharing some common properties. Community mining can be thought of as subgraph identification.

For example, in Web page linkage, two Web pages (objects) are related if there is a hyperlink between them. A graph of Web page linkages can be mined to identify a community or set of Web pages on a particular topic.

Most techniques for graph mining and community mining are based on a homogenous graph, that is, they assume that only one kind of relationship exists between the objects. However, in real social networks, there are always various kinds of relationships between the objects. Each relation can be viewed as a relation network. In this sense, the multiple relations form a multirelational social network (also referred to as a heterogeneous social network). Each kind of relation may play a distinct role in a particular task. Here, the different relation graphs can provide us with different communities. To find a community with certain properties, it is necessary to identify, which relation plays an important role in such a community. Such a relation might not exist explicitly, that is, first discover such a hidden relation before finding the community on such a relation network.

Different users may be interested in different relations within a network. Thus, if we mine networks by assuming only one kind of relation, we may end up missing out on a lot of valuable hidden community information, and such mining may not be adaptable to the diverse information needs of various users. This brings us to the problem of multirelational community mining, which involves the mining of hidden communities on heterogeneous social networks.

3.7 Multirelational Data Mining Multirelational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from a relational database. Consider the multirelational schema in the figure below, which defines a financial database. Each table or relation represents an entity or a relationship, described by a setoff attributes. Links between relations show the relationship between them. One method to apply traditional data mining methods (which assume that the data reside in a single table) is propositionalisation, which converts multiple relational data into a single flat data relation, using joins and aggregations. This, however, could lead to the generation of a huge, undesirable “universal relation” (involving all of the attributes). Furthermore, it can result in the loss of information, including essential semantic information represented by the links in the database design. Multirelational data mining aims to discover knowledge directly from relational data.

59/JNU OLE Data Mining

Loan Account District loan-id account –id district –id Card account –id district –id name card –id date frequency region disp-id amount date #people type duration #It-500 issue-date payment #It-2000 Transaction Disposition #It-10000 trans-id disp-id #gt-10000 Order account-id account-id #city order-id date client-id ration-urban account-id type type avg-salary to-bank operation unemply95 to-account amount Client unemploy96 amount balance client-id den-entry type symbol birthdate #crime95 gender #crime96 district-id

Fig. 3.9 A financial multirelational schema

There are different multirelational data mining tasks, including multirelational classification, clustering, and frequent pattern mining. Multirelational classification aims to build a classification model that utilises information in different relations. Multirelational clustering aims to group tuples into clusters using their own attributes as well as tuples related to them in different relations. Multirelational frequent pattern mining aims at finding patterns involving interconnected items in different relations.

In a database for multirelational classification, there is one target relation, Rt , whose tuples are called target tuples and are associated with class labels. The other relations are non-target relations. Each relation may have one primary key (which uniquely identifies tuples in the relation) and several foreign keys (where a primary key in one relation can be linked to the foreign key in another). If we assume a two-class problem, then we pick one class as the positive class and the other as the negative class. The most important task for building an accurate multirelational classifier is to find relevant features in different relations that help distinguish positive and negative target tuples.

3.8 Data Mining Algorithms and their Types The data mining algorithm is the mechanism that creates mining models. To create a model, an algorithm first analyses a set of data, looking for specific patterns and trends. The algorithm then uses the results of this analysis to define the parameters of the mining model.

The mining model that an algorithm creates can take various forms, including: • A set of rules that describe how products are grouped together in a transaction. • A decision tree that predicts whether a particular customer will buy a product.

60/JNU OLE • A mathematical model that forecasts sales. • A set of clusters that describe how the cases in a dataset are related.

Types of Data Mining Algorithms There are different types of data mining algorithms. However, the three main data mining algorithms are classification, clustering and association rules.

3.8.1 Classification Classification is the task of generalising known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam.

Preparing data for classification: The following pre-processing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification process.

Data cleaning: This refers to the pre-processing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (for example, by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although, most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning.

Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute to the classification or prediction task. Including such attributes may otherwise slow down, and possibly mislead, the learning step.

Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting “reduced” attribute (or feature) subset, should be less than the time that would have been spent on learning from the original set of attributes. Hence, such analysis can help to improve classification efficiency and scalability.

Data transformation and reduction: The data may be transformed by normalisation, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalisation involves scaling all values for a given attribute so that they fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. In methods that use distance measurements, for example, this would prevent attributes with initially large ranges (for example, income) from outweighing attributes with initially smaller ranges (such as binary attributes). The data can also be transformed by generalising it to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous valued attributes. For example, numeric values for the attribute income can be generalised to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can be generalised to higher-level concepts, like city. Because generalisation compresses the original training data, fewer input/output operations may be involved during learning. Data can also be reduced by applying many other methods, ranging from wavelet transformation and principle components analysis to discretisation techniques, such as binning, histogram analysis, and clustering.

Bayesian classification: Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes’ theorem. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other

61/JNU OLE Data Mining

attributes. This assumption is called class conditional independence. It is made to simplify the computations involved and, in this sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian classifiers allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also be used for classification.

Bayes’ theorem Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who performed early work in probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X. P(H X) is the posterior probability, or a posterior probability, of H conditioned on X.

For example, suppose our world of data tuples is confined to customers described by the attributes age and income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(H X) reflects the probability that customer X will buy a computer given that we know the customer’s age and income.

In contrast, P(H) is the prior probability, or a priori probability, of H. For the above example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(H X), is based on more information (for example, customer information) than the prior probability, P(H), which is independent of X.

Similarly, P(X ) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old and earns $40,000.

“How are these probabilities estimated?” P(H), P(X ), and P(X) may be estimated from the given data, as we shall see below. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H X), from P(H), P(X ), and P(X). Bayes’ theorem is

Naïve Bayesian classification The naïve Bayesian classifier, or simple Bayesian classifier, works as follows: • Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an

n-dimensional attribute vector, X = (x1, x2, ... , xn), depicting n

measurements made on the tuple from n attributes, respectively, A1, A2,... , An.

• Suppose that there are m classes, C1, C2, ... , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts

that tuple X belongs to the class Ci if and only if

for

Thus, we maximise . The classCi for which is maximised is called the maximum posteriori hypothesis according to the Bayes’ theorem.

62/JNU OLE • As P(X) is constant for all classes, only ) needs to be maximised. If the class prior probabilities

are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = ... =

P(Cm), and we would therefore, maximise . Otherwise, we maximise . Note that the class prior probabilities may be estimated by where is the number of training tuples

of class Ci in D. • Given data sets with many attributes, it would be extremely computationally expensive to compute. In order to reduce computation in evaluating , the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (that is there are no dependence relationships among the attributes). Thus,

=

We can easily estimate the probabilities from the training tuples. Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at whether the attribute is categorical or continuous-valued. For instance, to compute , we consider the following:

• If Ak is categorical, then ) is the number of tuples of class Ci in D having the value xk for Ak, divided

by , the number of tuples of class Ci in D.

• If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean and standard deviation , defined by

So that,

These equations may appear daunting, but we need to compute and , which are the mean (that is average) and standard deviation, respectively, of the values of attribute Ak for training tuples of class Ci. We then plug these st two quantities into the 1 equation above, together with xk, in order to estimate

Rule-Based Classification Rule-based classifiers are the learned model, which is represented as a set of IF-THEN rules.

Using IF-THEN rules for classification Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form

IF condition THEN conclusion.

An example is rule R1, R1: IF age = youth AND student = yes THEN buys_computer = yes. The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or precondition. The “THEN”-part (or

63/JNU OLE Data Mining

right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests (such as age = youth, and student = yes) that are logically ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a computer). R1 can also be written as

R1: (age = youth) ^ (student = yes) (buys computer = yes).

If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple.

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labelled data set, D, let ncovers

be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and |D| be the number of tuples in D. We can define the coverage and accuracy of R as

That is, a rule’s coverage is the percentage of tuples that are covered by the rule (that is whose attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers and see what percentage of them the rule can correctly classify.

Rule extraction from a Decision Tree Decision tree classifiers are a popular method of classification—it is easy to understand how decision trees work and they are known for their accuracy. Decision trees can become large and difficult to interpret. In this subsection, we look at how to build a rule based classifier by extracting IF-THEN rules from a decision tree. In comparison with a decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is very large. To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part).

Rule induction using a sequential covering algorithm IF-THEN rules can be extracted directly from the training data (that is without having to generate a decision tree first) using a sequential covering algorithm. The name comes from the notion that the rules are learned sequentially (one at a time), where each rule for a given class will ideally cover many of the tuples of that class (and hopefully none of the tuples of other classes). Sequential covering algorithms are the most widely used approach to mining disjunctive sets of classification rules, and form the topic of this subsection. Note that in a newer alternative approach, classification rules can be generated using associative classification algorithms, which search for attribute-value pairs that occur frequently in the data. These pairs may form association rules, which can be analysed and used in classification.

There are many sequential covering algorithms. Popular variations include AQ, CN2, and the more recent, RIPPER. The general strategy is such that : • Rules are learned one at a time. • Each time a rule is learned, the tuples covered by the rule are removed. • The process repeats on the remaining tuples.

This sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf in a decision tree corresponds to a rule, we can consider decision tree induction as learning a set of rules simultaneously.

64/JNU OLE A basic sequential covering algorithm is shown in following figure. Here, rules are learned for one class at a time. Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of class C and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy. The rules need not necessarily be of high coverage. This is because we can have more than one rule for a class, so that different rules may cover different tuples within the same class.

Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.

Input: ‚‚ D, a data set class-labelled tuples; ‚‚ Att_vals, the set of all attributes and their possible values.

Output: A set of IT-THEN rules.

Method:

(1) Rule_set = {}; // initial set of rules learned is empty (2) for each class c do (3) repeat (4) Rule = Learn_One_Rule (D, Att_vals, c); (5) remove tuples covered by Rule from D; (6) until terminating conditions; (7) Rule_set = Rule_set + Rule; // add new rule to rule set (8) endfor (9) return Rule_Set;

Fig. 3.10 Basic sequential covering algorithm

The process continues until the terminating condition is met, such as when there are no more training tuples or the quality of a rule returned is below a user-specified threshold. The Learn One Rule procedure finds the “best” rule for the current class, given the current set of training tuples.

Typically, rules are grown in a general-to-specific manner. We can think of this as a beam search, where we start off with an empty rule and then gradually keep appending attribute tests to it. We append by adding the attribute test as a logical conjunct to the existing condition of the rule antecedent. Suppose our training set, D, consists of loan application data. Attributes regarding each applicant include their age, income, education level, residence, credit rating, and the term of the loan. The classifying attribute is loan decision, which indicates whether a loan is accepted (considered safe) or rejected (considered risky). To learn a rule for the class “accept,” we start off with the most general rule possible, that is, the condition of the rule antecedent is empty. The rule is:

IF THEN loan_decision = accept.

65/JNU OLE Data Mining

IF THEN loan_decision = accept

IF income = high IF loan_term = short IF loan_term = long IF loan_term=medium THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept

IF income = high AND IF income = high AND IF income = high AND IF income = high AND age = youth age = middle_aged credit_rating = excellent credit_rating = fair THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept

Fig. 3.11 A general-to-specific search through rule space

We then consider each possible attribute test that may be added to the rule. These can be derived from the parameter Att_vals, which contains a list of attributes with their associated values. For example, for an attribute-value pair (att, val), we can consider attribute tests such as att = val, att val, att val, and so on. Typically, the training data will contain many attributes, each of which may have several possible values. Finding an optimal rule set becomes computationally explosive.

Instead, Learn One Rule adopts a greedy depth-first strategy. Each time it is faced with adding a new attribute test (conjunct) to the current rule, it picks the one that most improves the rule quality, based on the training samples. We will say more about rule quality measures in a minute. For the moment, let’s say we use rule accuracy as our quality measure. Getting back to our example with the above figure, suppose Learn One Rule finds that the attribute test income = high best improves the accuracy of our current (empty) rule. We append it to the condition, so that the current rule becomes

IF income = high THEN loan decision = accept.

Each time we add an attribute test to a rule, the resulting rule should cover more of the “accept” tuples. During the next iteration, we again consider the possible attribute tests and end up selecting credit rating = excellent. Our current rule grows to become

IF income = high AND credit rating = excellent THEN loan decision = accept.

The process repeats, where at each step, we continue to greedily grow rules until the resulting rule meets an acceptable quality level. Greedy search does not allow for backtracking. At each step, we heuristically add what appears to be the best choice at the moment. What if we unknowingly made a poor choice along the way? In order to lessen the chance of this happening, instead of selecting the best attribute test to append to the current rule, we can select the best k attribute tests. In this way, we perform a beam search of width k wherein we maintain the k best candidates overall at each step, rather than a single best candidate.

Classification by backpropagation Backpropagation is a neural network learning algorithm. The field of neural networks was originally kindled by psychologists and neurobiologists who sought to develop and test computational analogues of neurons. Roughly speaking, a neural network is a set of connected input/output units in which each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the

66/JNU OLE correct class label of the input tuples. Neural network learning is also referred to as connectionist learning due to the connections between units. Neural networks involve long training times and are therefore more suitable for applications where this is feasible. They require a number of parameters that are typically best determined empirically, such as the network topology or “structure.” Neural networks have been criticised for their poor interpretability. For example, it is difficult for humans to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network. These features initially made neural networks less desirable for data mining.

A multilayer feed-forward neural network The backpropagation algorithm performs learning on a multilayer feed-forward neural network. It iteratively learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. An example of a multilayer feed-forward network is shown in the following figure.

Each layer is made up of units. The inputs to the network correspond to the attributes measured for each training tuple. The inputs are fed simultaneously into the units making up the input layer. These inputs pass through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer. The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network’s prediction for given tuples. The units in the input layer are called input units. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. The multilayer neural network shown in the figure has two layers of output units.

Input Hidden Output layer layer layer

x1

w1j

x2 w2j

wij wjk x3

oj ok

wnj x4

Fig. 3.12 A multilayer feed-forward neural network

Therefore, we say that it is a two-layer neural network. (The input layer is not counted because it serves only to pass the input values to the next layer.) Similarly, a network containing two hidden layers is called a three-layer neural network, and so on. The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a previous layer. It is fully connected in each unit, which provides input to each unit in the next forward layer.

Defining a network topology Before beginning the training, the user must decide on the network topology by specifying the number of units in the input layer, the number of hidden layers (if more than one), the number of units in each hidden layer, and the number of units in the output layer.

67/JNU OLE Data Mining

Normalising the input values for each attribute measured in the training tuples will help speed up the learning phase. Typically, input values are normalised so as to fall between 0.0 and 1.0. Discrete-valued attributes may be encoded such that there is one input unit per domain value. For example, if an attribute A has three possible or known values,

namely {a0, a1, a2}, then we may assign three input units to represent A. That is, we may have, say, I0, I1, I2 as input

units. Each unit is initialised to 0. If A=a0, then I0 is set to 1. If A = a1, I1 is set to 1, and so on. Neural networks can be used for either classification (to predict the class label of a given tuple) or prediction (to predict a continuous-valued output). For classification, one output unit may be used to represent two classes (where the value 1 represents one class and the value 0 represents the other). If there are more than two classes, then one output unit per class is used. There are no clear rules as to the “best” number of hidden layer units. Network design is a trial-and-error process and may affect the accuracy of the resulting trained network.

The initial values of the weights may also affect the resulting accuracy. Once a network has been trained and its accuracy is not considered acceptable, it is common to repeat the training process with a different network topology or a different set of initial weights. Cross-validation techniques for accuracy estimation can be used to help to decide when an acceptable network has been found. A number of automated techniques have been proposed that search for a “good” network structure. These typically use a hill-climbing approach that starts with an initial structure that is selectively modified.

Backpropagation Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s prediction for each tuple with the actual known target value. The target value may be the known class label of the training tuple (for classification problems) or a continuous value (for prediction). For each training tuple, the weights are modified so as to minimise the mean squared error between the network’s prediction and the actual target value. These modifications are made in the “backwards” direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general the weights will eventually converge, and the learning process stops. The algorithm is summarised in the following figure. The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this is your first look at neural network learning. However, once you become familiar with the process, you will see that each step is inherently simple. The steps are described below. • Algorithm: Backpropagation, Neural network learning for classification or prediction using the backpropagation algorithm. • Input: ‚‚ D, a data set consisting of the training tuples and their associated target values; ‚‚ l, the learning rate; ‚‚ network, a multilayer feed-forward network. • Output: A trained neural network. • Method:

1. Initialise all weights and biases in network;

2. while terminating condition is not satisfied {

3. for each training tuple X in D {

4. // Propagate the inputs forward:

5. for each input layer unit j {

6. Oj = Ij; // output of an input unit is its actual input value

7. for each hidden or output layer unit j {

68/JNU OLE 8. Ij = ; //compute the net input of unit j with respect to the

previous layer, i

Oj = // compute the output of each unit j

9. // Backpropagate the errors:

10. for each unit j in the output layer

11. Err j = Oj(1- Oj)(Tj - Oj); // compute the error

12. for each unit j in the hidden layers, from the last to the first hidden layer

13. Err j = Oj(1- Oj) // compute the error with respect to the

14. next higher layer, k

15. for each weight wij in network {

16. ; // weight increment

17. ;} // weight update

18. for each bias in network {

19. ; // bias increment

20. ; } // bias update

21. } }

3.8.2 Clustering The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of . Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labelling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (for example, using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups.

Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterise customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorise genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help to classify documents on the Web for information discovery. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according

69/JNU OLE Data Mining

to their similarity. Clustering can also be used for outlier detection, where outliers (values that are “far away” from any cluster) may be more interesting than common cases.

Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterisation, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features.

Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labelled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases.

Clustering is a challenging field of research in which its potential applications pose their own special requirements. The following are typical requirements of clustering in data mining:

Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may Scalability contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Many algorithms are designed to cluster interval- based (numerical) data. However, applications Ability to deal with different types of attributes may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance Discovery of clusters with arbitrary shape measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape.

70/JNU OLE Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input Minimal requirements for domain knowledge to parameters. Parameters are often difficult to determine input parameters determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control. Most real-world databases contain outliers or missing, unknown, or erroneous data. Some Ability to deal with noisy data clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Some clustering algorithms cannot incorporate newly inserted data (that is database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of Incremental clustering and insensitivity to the input data. That is, given a set of data objects, such order of input records an algorithm may return dramatically different clustering depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input. A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. High Dimensionality Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that such data can be sparse and highly skewed. Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, Constraint-based clustering you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behaviour that satisfy specified constraints. Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations Interpretability and usability and applications. It is important to study how an application goal may influence the selection of clustering features and methods.

Table 3.2 Requirements of clustering in data mining

71/JNU OLE Data Mining

Types of data in cluster analysis Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents, countries, and so on. Main memory-based clustering algorithms typically operate on either of the following two data structures.

Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects p variables):

Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are accessible for all pairs of n objects. It is often represented by an n-by-n table:

where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a nonnegative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they differ. Since d(i, j)=d( j, i), and d(i, i)=0, we have the above matrix.

The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent the same entity. Thus, the data matrix is often called a two-mode matrix, whereas the dissimilarity matrix is called a one-mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying such clustering algorithms.

Categorisation of major clustering methods Many clustering algorithms exist in the literature. It is difficult to provide a crisp categorisation of clustering methods as these categories may overlap, so that a method may have features from several categories. Nevertheless, it is useful to present a relatively organised picture of the different clustering methods. In general, the major clustering methods can be classified into the following categories.

Partitioning methods To achieve global optimality in partitioning-based clustering, we would require the exhaustive enumeration of all of the possible partitions. The heuristic clustering methods work well for finding spherical-shaped clusters in small to medium-sized databases. To find clusters with complex shapes and for clustering very large data sets, partitioning- based methods need to be extended. The most well-known and commonly used partitioning methods are k-means, k-medoids, and their variations.

72/JNU OLE k-means algorithm The k-means algorithm proceeds as follows: First, it randomly selects k of the objects, each of which initially represents a cluster mean or centre. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the square-error criterion is used, defined as

Where, E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional). In other words, for each object in each cluster, the distance from the object to its cluster centre is squared, and the distances are summed. This criterion tries to make the resulting k clusters as compact and as separate as possible. k-medoids algorithm Each remaining object is clustered with the representative object to which it is the most similar. The partitioning method is then performed based on the principle of minimising the sum of the dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion is used, defined as

where E is the sum of the absolute error for all objects in the data set; p is the point in space representing a given object in cluster Cj; and oj is the representative object of Cj. In general, the algorithm iterates until, eventually, each representative object is actually the medoid, or most centrally located object, of its cluster. This is the basis of the k-medoids method for grouping n objects into k clusters.

The k-medoids clustering process is discussed below. • The initial representative objects (or seeds) are chosen arbitrarily. • The iterative process of replacing representative objects by non-representative objects continues as long as the quality of the resulting clustering is improved. • This quality is estimated using a cost function that measures the average dissimilarity between an object and the representative object of its cluster.

• To determine whether a non-representative object, orandom, is a good replacement for a current representative

object, oj, the following four cases are examined for each of the non-representative objects, p.

Case 1: p currently belongs to representative object, oj. If oj is replaced by orandom as a representative object and p is closest to one of the other representative objects, oi, i j, then p is reassigned to oi.

Case 2: p currently belongs to representative object, oj. If oj is replaced by orandom as a representative object and p is closest to orandom, then p is reassigned to orandom.

Case 3: p currently belongs to representative object, oi, i j. If oj is replaced by orandom as a representative object and p is still closest to oi, then the assignment does not change.

Hierarchical methods A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one

73/JNU OLE Data Mining

(the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.

In general, there are two types of hierarchical clustering methods: • Agglomerative hierarchical clustering:This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity. • Divisive hierarchical clustering • This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.

Density-based methods Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical- shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the “neighbourhood” exceeds some threshold; that is, for each data point within a given cluster, the neighbourhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape.

Density-based spatial clustering of applications with noise DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points.

The basic ideas of density-based clustering involve a number of new definitions. • The neighbourhood within a radius of a given object is called the -neighbourhood of the object. • If the -neighbourhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object. • Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the -neighbourhood of q, and q is a core object. • An object p is density-reachable from object q with respect to e and MinPts in a set of objects, D, if there is a

chain of objects p1, : : : , pn, where p1 = q and pn = p such that pi+1 is directly density-reachable from pi with

respect to e and MinPts, for 1 i n, pi D. • An object p is density-connected to object q with respect to and MinPts in a set of objects, D, if there is an object o D such that both p and q are density-reachable from o with respect to and MinPts.

OPTICS: Ordering Points to Identify the Clustering Structure A cluster analysis method called OPTICS was proposed. Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. This ordering represents the density-based clustering structure of the data. It contains information that is equivalent to density- based clustering obtained from a wide range of parameter settings. The cluster ordering can be used to extract basic clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide the intrinsic clustering structure.

74/JNU OLE Grid-based methods Grid-based methods quantise the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (that is on the quantised space). The main benefit of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantised space. STING is a typical example of a grid-based method.

STING: Statistical Information Grid STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is precomputed and stored.

1st layer

(i-1)-st layer

ith layer

Fig. 3.13 A hierarchical structure for STING clustering

WaveCluster: Clustering using Wavelet Transformation • The advantages of wavelet transformation for clustering are explained below. It provides unsupervised clustering. It uses hat-shaped filters that emphasise regions where the points cluster, while suppressing weaker information outside of the cluster boundaries. Thus, dense regions in the original feature space act as attractors for nearby points and as inhibitors for points that are further away. This means that the clusters in the data automatically stand out and “clear” the regions around them. Thus, another advantage is that wavelet transformation can automatically result in the removal of outliers. ‚‚ The multiresolution property of wavelet transformations can help detect clusters at varying levels of accuracy. ‚‚ Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is the number of objects in the database. The algorithm implementation can be made parallel.

75/JNU OLE Data Mining

Model-based methods Model-based methods hypothesise a model for each of the clusters and find the best fit of the data to the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of clusters based on standard statistics, taking “noise” or outliers into account and thus yielding robust clustering methods. EMis an algorithm that performs expectation-maximisation analysis based on statistical modelling. COBWEB is a conceptual learning algorithm that performs probability analysis and takes concepts as a model for clusters.

Expectation-Maximisation The EM (Expectation-Maximisation) algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates. It can be viewed as an extension of the k-means paradigm, which assigns an object to the cluster with which it is most similar, based on the cluster mean. The algorithm is described as follows: • Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster means or centres (as in k-means partitioning), as well as making guesses for the additional parameters.

g(m2, σ2)

g(m1, σ1)

Fig. 3.14 EM algorithm

Each cluster can be represented by a probability distribution, centred at a mean, and with a standard deviation. Here,

we have two clusters, corresponding to the Gaussian distributions g(m1, 1) and g(m2, 2), respectively, where the dashed circles represent the first standard deviation of the distributions.

• Iteratively refine the parameters (or clusters) based on the following two steps:

(a) Expectation Step: Assign each object xi to cluster Ck with the probability

) =

where p(xi|Ck) = N(mk, Ek(xi)) follows the normal (that is Gaussian) distribution around mean, mk, with expectation,

Ek. In other words, this step calculates the probability of cluster membership of object xi, for each of the clusters.

These probabilities are the “expected” cluster memberships for object xi.

(b) Maximisation Step: Use the probability estimates from above to re-estimate (or refine) the model parameters.

This step is the “maximisation” of the likelihood of the distributions given the data.

76/JNU OLE 3.8.3 Association Rules The goal of the techniques described in this topic is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and research too.

Working of association rules The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose we are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, that is to derive association rules, which identify the items and co-occurrences of various items that appear with the greatest (co-)frequencies.

The rules of association are discussed below: • Sequence analysis: Sequence analysis is concerned with a subsequent purchase of a product or products given a previous buy. For instance, buying an extended warranty is more likely to follow (in that specific sequential order) the purchase of a TV or other electric appliances. Sequence rules, however, are not always that obvious, and sequence analysis helps you to extract such rules no matter how hidden they may be in your market basket data. There is a wide range of applications for sequence analysis in many areas of industry including customer shopping patterns, phone call patterns, the fluctuation of the stock market, DNA sequence, and Web log streams. • Link analysis: Once extracted, rules about associations or the sequences of items as they occur in a transaction database can be extremely useful for numerous applications. Obviously, in retailing or marketing, knowledge of purchase “patterns” can help with the direct marketing of special offers to the “right” or “ready” customers (that is those who, according to the rules, are most likely to purchase specific items given their observed past consumption patterns). However, transaction databases occur in many areas of business, such as banking. In fact, the term “link analysis” is often used when these techniques - for extracting sequential or non-sequential association rules - are applied to organise complex “evidence.” It is easy to see how the “transactions” or “shopping basket” metaphor can be applied to situations where individuals engage in certain actions, open accounts, contact other specific individuals, and so on. Applying the technologies described here to such databases may quickly extract patterns and associations between individuals and actions and, hence, for example, reveal the patterns and structure of some clandestine illegal network. • Unique data analysis requirements: Cross tabulation tables, and in particular Multiple Response tables can be used to analyse data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the “factorial degree” of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple “bookstore-example” discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete cross tabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules “hiding” in the data, we would miss them completely. The a-priori algorithm implemented in Association Rules will not only automatically detect the relationships (“cross-tabulation tables”) that are important (that is cross-tabulation tables that are not sparse, not containing mostly zero’s), but also determine the factorial degree of the tables that contain the important association rules.

77/JNU OLE Data Mining

Tabular representation of associations Association rules are generated of the general form if Body then Head, where Body and Head stand for single codes or text values (items) or conjunctions of codes or text values (items; for example, if (Car=Porsche and Age<20) then (Risk=High and Insurance=High). The major statistics computed for the association rules are Support (relative frequency of the Body or Head of the rule), Confidence (conditional probability of the Head given the Body of the rule), and Correlation (support for Body and Head, divided by the square root of the product of the support for the Body and the support for the Head). These statistics can be summarised in a , as shown below.

Fig. 3.15 Tabular representation of association (Source: http://www.statsoft.com/textbook/association-rules/)

This results spreadsheet shows an example of how association rules can be applied to text mining tasks. This analysis was performed on the paragraphs (dialog spoken by the characters in the play) in the first scene of Shakespeare’s “All’s Well That Ends Well,” after removing a few very frequent words like is, of, and so on. The values for support, confidence, and correlation are expressed in percent.

Graphical representation of association As a result of applying Association Rules data mining techniques to large datasets rules of the form if “Body” then “Head” will be derived, where Body and Head stand for simple codes or text values (items), or the conjunction of codes and text values (items; for example, if (Car=Porsche and Age<20) then (Risk=High and Insurance=High)). These rules can be reviewed in textual format or tables, or in graphical format.

Association Rules Networks, 3D: Association rules can be graphically summarised in 2D Association Networks, as well as 3D Association Networks. Shown below are some (very clear) results from an analysis. Respondents in a survey were asked to list their (up to) 3 favourite fast-foods. The association rules derived from those data are summarised in a 3D Association Network display.

78/JNU OLE Fig. 3.16 Association Rules Networks, 3D (Source: http://www.statsoft.com/textbook/association-rules/)

79/JNU OLE Data Mining

Summary • Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data. • Data mining techniques can be applied to a wide range of data repositories including databases, data warehouses, spatial data, multimedia data, Internet or Web-based data and complex objects. • The Knowledge Discovery in Databases process consists of a few steps leading from raw data collections to some form of new knowledge. • Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data warehouses, the selection is done on transformed data. • Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein of valuable ore. • The concept of bagging applies to the area of predictive data mining, to combine the predicted classifications from multiple models, or from the same type of model for different learning data. • The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in 1996, by analysts representing DaimlerChrysler, SPSS, and NCR. CRISP provides a non-proprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unit. • With the increasing demand on the analysis of large amounts of structured data, graph mining has become an active and important theme in data mining. • The graph is typically very large, with nodes corresponding to objects and edges corresponding to links representing relationships or interactions between objects. Both nodes and links have attributes. • Social networks are rarely static. Their graph representations evolve as nodes and edges are added or deleted over time. • Multirelational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from a relational database. • The data mining algorithm is the mechanism that creates mining models. • Classification is the task of generalising known structure to apply to new data. • A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. • The goal of the techniques described in this topic is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining.

References • Seifert, J, W., 2004. Data Mining: An Overview [Online PDF] Available at: . [Accessed 9 September 2011]. • Alexander, D., Data Mining [Online] Available at: . [Accessed 9 September 2011]. • Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier. • Adriaans, P., 1996. Data Mining, Pearson Education India. • StatSoft, 2010. Data Mining, Cluster Techniques - Session 28 [Video Online] Available at: < http://www.youtube. com/watch?v=WvR_0Vs1U8w>. [Accessed 12 September 2011]. • Swallacebithead, 2010. Using Data Mining Techniques to Improve Forecasting [Video Online] Available at: . [Accessed 12 September 2011].

80/JNU OLE Recommended Reading • Chattamvelli, R., 2011. Data Mining Algorithms, Alpha Science International Ltd. • Thuraisingham, B. M., 1999. Data mining: technologies, techniques, tools, and trends, CRC Press. • Witten, I. H. and Frank, E., 2005. Data mining: practical machine learning tools and techniques, 2nd ed., Morgan Kaufmann

81/JNU OLE Data Mining

Self Assessment 1. Which of the following refers to the process of finding interesting patterns in data that are not explicitly part of the data? a. Data mining b. Data warehousing c. Data extraction d. Metadata

2. What is the full form of KDD? a. Knowledge Data Defining b. Knowledge Defining Database c. Knowledge Discovery in Database d. Knowledge Database Discovery

3. Which of the following is used to address the inherent instability of results while applying complex models to relatively small data sets? a. Boosting b. Bagging c. Data reduction d. Data preparation

4. The concept of ______in predictive data mining refers to the application of a model for prediction or classification to new data. a. bagging b. boosting c. drill-down analysis d. deployment

5. What is referred to as Stacking (Stacked Generalisation)? a. Meta-learning b. Metadata c. Deployment d. Boosting

6. Which statement is false? a. The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. b. In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organisation. c. The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases d. Text mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest.

82/JNU OLE 7. The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in ______. a. 1995 b. 1996 c. 1997 d. 1998

8. Match the column A. It is a heterogeneous and multirelational 1. Densification power law data set represented by a graph. B. This phenomena is represented in the 2. Shrinking diameter preferential attachment model C. This contradicts an earlier belief that the 3. Heavy-tailed out-degree and in-degree diameter slowly increases as a function distributions of network size decreases. D. This was known as the constant average 4. Social Network Analysis degree assumption. a. 1-A, 2-B, 3-C, 4-D b. 1-D, 2-C, 3-B, 4-A c. 1-B, 2-A, 3-D, 4-C d. 1-C, 2-D, 3-4, D-B

9. ______algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates. a. Model-based methods b. Wavecluster c. Expectation-Maximisation d. Grid-based method

10. ______is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells. a. STING b. KDD c. DBSCAN d. OPTICS

83/JNU OLE Data Mining

Chapter IV Web Application of Data Mining

Aim

The aim of this chapter is to:

• introduce the concept of knowledge discovery in databases

• analyse different goals of data mining and knowledge discovery

• explore the knowledge discovery process

Objectives

The objectives of this chapter are to:

• highlight web mining

• describe what is graph mining

• elucidate web content mining

Learning outcome

At the end of this chapter, you will be able to:

• enlist types of knowledge discovered during data mining

• comprehend benefits of web mining

• understand web and web usage mining

84/JNU OLE 4.1 Introduction Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more than data mining. The knowledge discovery process comprises six phases, such as data selection, data cleansing, enrichment, data transformation or encoding, data mining, and the reporting and display of the discovered information.

As an example, consider a transaction database maintained by a specialty consumer goods retailer. Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount. A variety of new knowledge can be discovered by KDD processing on this client database. During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. The data cleansing process correct invalid zip codes or eliminate records with incorrect phone prefixes. Enrichment typically enhances the data with additional sources of information. For example, given the client names and phone numbers, the store may purchase other data about age, income, and credit rating and append them to each record.

Data transformation and encoding may be done to reduce the amount of data. For instance, item codes may be grouped in terms of product categories into audio, video, supplies, electronic gadgets, camera, accessories, and so on. Zip codes may be aggregated into geographic regions; incomes may be divided into ranges, and so on. If data mining is based on an existing warehouse for this retail store chain, we would expect that the cleaning has already been applied. It is only after such preprocessing that data mining techniques are used to mine different rules and patterns.

We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such as age, income group, place of residence, to what and how much the customers purchase. This information can then be utilised to plan additional store locations based on demographics, to run store promotions, to combine items in advertisements, or to plan seasonal marketing strategies. As this retail store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions. The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualisations.

Business Case Definition

Knowledge Data Evaluation Preparation Base

Data Mining

Fig. 4.1 Knowledge base

85/JNU OLE Data Mining

4.2 Goals of Data Mining and Knowledge Discovery Data mining is typically carried out with some end goals or applications. Broadly speaking, these goals fall into the following classes: prediction, identification, classification, and optimisation. • Prediction-Data mining can show how certain attributes within the data will behave in the future. Examples of predictive data mining include the analysis of buying transactions to predict what consumers will buy under certain discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits. In such applications, business logic is used coupled with data mining. In a scientific context, certain seismic wave patterns may predict an earthquake with high probability. • Identification-Data patterns can be used to identify the existence of an item, an event, or an activity. For instance, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session. In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. The area known as authentication is a form of identification. It confirms whether a user is indeed a specific user or one from an authorized class, and involves a comparison of parameters or images or signals against a database. • Classification-Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters. For example, customers in a supermarket can be categorized into discount- seeking shoppers, shoppers in a rush, loyal regular shoppers, shoppers attached to name brands, and infrequent shoppers. This classification may be used in different analyses of customer buying transactions as a post-mining activity. Sometimes, classification based on common domain knowledge is used as an input to decompose the mining problem and make it simpler. For instance, health foods, party foods, or school lunch foods are distinct categories in the supermarket business. It makes sense to analyze relationships within and across categories as separate problems. Such categorization may be used to encode the data appropriately before subjecting it to further data mining. • Optimisation-One eventual goal of data mining may be to optimise the use of limited resources such as time, space, money, or materials and to maximise output variables such as sales or profits under a given set of constraints. As such, this goal of data mining resembles the objective function used in operations research problems that deals with optimisation under constraints.

The term data mining is popularly being used in a very broad sense. In some situations it includes statistical analysis and constrained optimization as well as machine learning. There is no sharp line separating data mining from these disciplines. It is beyond our scope, therefore, we need to discuss in detail the entire range of applications that make up this vast body of work. For a detailed understanding of the area, readers are referred to specialized books devoted to data mining.

4.3 Types of Knowledge Discovered during Data Mining The term “knowledge” is very broadly interpreted as involving some degree of intelligence. There is a progression from raw data to information to knowledge as we go through additional processing.

Knowledge is often classified as inductive versus deductive. Deductive knowledge deduces new information based on applying pre-specified logical rules of deduction on the given data. Data mining addresses inductive knowledge, which discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In an unstructured sense, it can be represented by rules or propositional logic. In a structured form, it may be represented in decision trees, semantic networks, neural networks, or hierarchies of classes or frames. It is common to describe the knowledge discovered during data mining in five ways, as follows: • Association rules-These rules correlate the presence of a set of items with another range of values for another set of variables. Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c.

86/JNU OLE • Classification hierarchies-The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Examples: (I) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions. (2) A model may be developed for the factors that determine the desirability of location of a store on a 1-10 scale. (3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability. • Sequential patterns-A sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months. Detection of sequential patterns is equivalent to detecting associations among events with certain temporal relationships. • Patterns within time series-Similarities can be detected within positions of a time series of data, which is a sequence of data taken at regular intervals such as daily sales or daily closing stock prices. Examples: (1) Stocks of a utility company, ABC Power, and a financial company, XYZ Securities, showed the same pattern during 2002 in terms of closing stock price. (2) Two products show the same selling pattern in summer but a different pattern in winter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions. • Clustering-A given population of events or items can be partitioned (segmented) into sets of “similar” elements. Examples: (1) An entire population of treatment data on a disease may be divided into groups based on the similarity of side effects produced. (2) The adult population in the United States may be categorised into five groups from “most likely to buy” to “least likely to buy” a new product. (3) The web accesses made by a collection of users against a set of documents (say, in a digital library) may be analysed in terms of the keywords of documents to reveal clusters or categories of users.

For most applications, the desired knowledge is a combination of the above discussed types.

4.4 Knowledge Discovery Process Before, It is essential to understand the overall approach, before one attempts to extract useful knowledge from data . Merely knowing many algorithms used for data analysis is not sufficient for a successful data mining (DM) project. The process defines a sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (for example, patterns) in data. Each step is usually realised with the help of available commercial or open-source software tools. To formalise the knowledge discovery processes (KDPs) within a common framework, we introduce the concept of a process model. The model helps organisations to better understand the KDP and provides a roadmap to follow while planning and executing the project. This in turn results in cost and time savings, better understanding, and acceptance of the results of such projects. We need to understand that such processes are nontrivial and involve multiple steps, reviews of partial results, possibly several iterations, and interactions with the data owners. There are several reasons to structure a KDP as a standardised process model: • The end product must be helpful for the user/owner of the data. A blind, unstructured application of DM techniques to input data, called data dredging, normally produces meaningless results/knowledge that is knowledge that, while interesting, does not contribute to solving the user’s problem. This result ultimately leads to the failure of the project. Only through the application of well-defined KDP models will the end product be valid, novel, useful, and understandable. • A well-defined KDP model should have a logical, cohesive, well-thought-out structure and approach that can be presented to decision-makers who may have difficulty understanding the need, value, and mechanics behind a KDP. Humans often fail to grasp the potential knowledge available in large amounts of untapped and possibly valuable data. They often do not want to devote significant time and resources to the pursuit of formal methods of knowledge extraction from the data, but rather prefer to rely heavily on the skills and experience of others (domain experts) as their source of information. However, because they are typically ultimately responsible for the decision(s) based on that information, they frequently want to understand (be comfortable with) the technology applied to those solution. A process model that is well structured and logical will do much to alleviate any misgivings they may have. • Knowledge discovery projects require a significant project management effort that requires to be grounded in a solid framework. Most knowledge discovery projects involve teamwork and thus need careful planning and

87/JNU OLE Data Mining

scheduling. For most project management specialists, KDP and DM are not familiar terms. Therefore, these specialists need a definition of what such projects involve and how to carry them out in order to develop a sound project schedule. • Knowledge discovery should follow the example of other engineering disciplines that already have established models. A good example is the software engineering field, which is a relatively new and dynamic discipline that exhibits many characteristics that are pertinent to knowledge discovery. Software engineering has adopted several development models, including the waterfall and spiral models that have become well-known standards in this area. • There is a widely recognized need for standardization of the KDP. The challenge for modern data miners is to come up with widely accepted standards that will stimulate major industry growth. Standardization of the KDP model would allow the development of standardised methods and procedures, thereby enabling end users to deploy their projects more easily. It would lead directly to project performance that is faster, cheaper, more reliable, and more manageable. The standards would promote the development and delivery of solutions that use business terminology rather than the traditional language of algorithms, matrices, criterions, complexities, and the like, resulting in greater exposure and acceptability for the knowledge discovery field.

As there is some confusion about the terms data mining, knowledge discovery, and knowledge discovery in databases, we first need to define them. Note, however, that many researchers and practitioners use DM as a synonym for knowledge discovery; DM is also just one step of the KDP.

The knowledge discovery process (KDP), also called knowledge discovery in databases, seeks new knowledge in some application domain. It is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

The process generalises to non-database sources of data, although it emphasizes databases as a primary source of data. It consists of many steps (one of them is DM), each attempting to complete a particular discovery task and each accomplished by the application of a discovery method. Knowledge discovery concerns the entire knowledge extraction process, including how data are stored and accessed, how to use efficient and scalable algorithms to analyze massive datasets, how to interpret and visualize the results, and how to model and support the interaction between human and machine. It also concerns support for learning and analyzing the application domain.

Knowledge Input data (patterns, rules, (database, images, Step 1 Step2 Step n-1 Step n clusters, classification, video, semi-struc- association, etc.) tured data, etc.)

Fig. 4.2 Sequential structure of KDP model (Source: www.springer.com/cda/content/)

4.4.1 Overview of Knowledge Discovery Process The KDP model consists of a set of processing steps to be followed by practitioners when executing a knowledge discovery project. The model describes procedures that are performed in each of its steps. It is primarily used to plan, work through, and reduce the cost of any given project.

Since the 1990s, various different KDPs have been developed. The initial efforts were led by academic research but were quickly followed by industry. The first basic structure of the model was proposed by Fayyad et al. and later improved/modified by others. The process consists of multiple steps that are executed in a sequence. Each subsequent step is initiated upon successful completion of the previous step, and requires the result generated by the previous step as its input. Another common feature of the proposed models is the range of activities covered,

88/JNU OLE which stretches from the task of understanding the project domain and data, through data preparation and analysis, to evaluation, understanding, and application of the generated results.

All the proposed models also emphasise the iterative nature of the model, in terms of many feedback loops that are triggered by a revision process. A schematic diagram is shown in the above figure. The main differences between the models described here are found in the number and scope of their specific steps. A common feature of all models is the definition of inputs and outputs. Typical inputs include data in various formats, such as numerical and nominal data stored in databases or flat files; images; video; semi-structured data, such as XML or HTML; and so on. The output is the generated new knowledge — usually described in terms of rules, patterns, classification models, associations, trends, statistical analysis and so on.

Relative effort [%] Cabena et al. estimate 70 Shearer estimates

60 Cios and Kurgan estimates 50 40 30 20 10 0 Understanding Understanding Preparation of Data mining Evaluation of Deployment of of domain of domain data results results KDDM steps

Fig. 4.3 Relative effort spent on specific steps of the KD process (Source: www.springer.com/cda/content/)

Most models follow a similar sequence of steps, while the common steps between the five are domain understanding, data mining, and evaluation of the discovered knowledge. The nine-step model carries out the steps concerning the choice of DM tasks and algorithms late in the process. The other models do so before preprocessing of the data in order to obtain data that are correctly prepared for the DM step without having to repeat some of the earlier steps. In the case of Fayyad’s model, the prepared data may not be suitable for the tool of choice, and thus a loop back to the second, third, or fourth step may be needed. The five-step model is very similar to the six-step models, except that it omits the data understanding step. The eight-step model gives a very detailed breakdown of steps in the initial phases of the KDP, but it does not allow for a step concerned with applying the discovered knowledge. Simultaneously, it recognizes the important issue of human resource identification.

A very important aspect of the KDP is the relative time spent in completing each of the steps. Evaluation of this effort allows precise scheduling. Various estimates have been proposed by researchers and practitioners alike. Figure 4.3 shows a comparison of these different estimates. We note that the numbers given are only estimates, which are used to quantify relative effort, and their sum may not equal 100%. The specific estimated values depend on many factors, such as existing knowledge about the considered project domain, the skill level of human resources, and the complexity of the problem at hand, to name just a few. The common theme of all estimates is an acknowledgment that the data preparation step is by far the most time-consuming part of the KDP.

The following summarises the steps of the knowledge discovery process: • Define the Problem: The firststep involves understanding the problem and figuring out what the goals and expectations are of the project. • Collect, clean, and prepare the data: This requires figuring out what data are needed, which data are most important and integrating the information. This step needs considerable effort, as much as 70% of the total data mining effort. • Data mining: This model-building step includes selecting data mining tools, transforming the data if the tool needs it, generating samples for training and testing the model, and finally using the tools to build and select a model.

89/JNU OLE Data Mining

• Validate the models: Test the model to make sure that it is producing accurate and adequate results. • Monitor the model. Monitoring a model is essential as with passing time, it will be necessary to revalidate the model to make sure that it is still meeting requirements. A model that works today may not work tomorrow; therefore, it is necessary to monitor the behaviour of the model to ensure it is meeting performance standards.

4.5 Web Mining With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to use automated tools in order to find the desired information resources, and to track and analyse their usage patterns. These factors give rise to the necessity of creating server­side and clientside­ intelligent systems that can effectively mine for knowledge. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available online,­ that is Web content mining, and the discovery of user access patterns from Web servers, that is Web usage mining.

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World­Wide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent­based technology may also fall in this category. Web structure mining is the process of inferring knowledge from the WorldWide­ Web organisation and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs.

Data Cleaning Transaction Data Identification Integration Transformation Pattern discovery Pattern Analysis

Clean log Transaction data Server log data Formatted data

OLAP/ Path analysis Visualisation tools

Integrated data Association Registration data rules Knowledge Name: query Address: mechanism Age: Occup: Sequential patterns

Documents and usage Cluster and Database Attributes classification Intelligent query rules agent language

Stage 1 Stage 2

Fig. 4.4 Web mining architecture (Source: http://www.galeas.de/webmining.html)

4.5.1 Web Analysis All visitors to a web site leave digital trails, which servers automatically store in log files. Web analysis tools analyse and process these web server log files to produce meaningful information. Essentially a complete profile of site traffic is created, for example how many visitors to the site, what sites they have come from, and which of the site’s pages are most popular.

90/JNU OLE Web analysis tools offer companies with previously unknown statistics and helpful insights into the behaviour of their on-line customers. While the usage and popularity of such tools may continue to increase. Many e-tailers are now demanding more useful information on their customers from the vast amounts of data generated by their web sites.

The result of the changing paradigm of commerce, from traditional brick and mortar shop fronts to electronic transactions over the Internet, has been the dramatic shift in the relationship between e-tailers and their customers. There is no longer any personal contact between retailers and customers. Customers are now highly mobile and are demonstrating loyalty only to value, often irrespective of brand or geography. A major challenge for e-tailers is to identify and understand their new customer base. E-tailers need to learn as much as possible regarding the behaviour, the individual tastes and the preferences of the visitors to their sites in order to remain competitive in this new era of commerce.

4.5.2 Benefits of Web mining Web Mining allows e-tailers to leverage their on-line customer data by understanding and predicting the behaviour of their customers. For the first time e-tailers now have access to detailed marketing intelligence on the visitors to their web sites. The business benefits that web mining afford to digital service providers include - enhanced customer support, personalisation, collaborative filtering, , product and service strategy definition, particle marketing and fraud detection. In short, the ability to understand their customers’ needs and to deliver the best and most appropriate service to those individual customers at any given moment.

4.6 Web Content Mining Web content mining, also known as text mining, is generally the second step in Web data mining. Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance of the content to the search query. This scanning is completed after the clustering of web pages through structure mining and provides the results based upon the level of relevance to the suggested query. With the massive amount of information that is available on the World Wide Web, content mining provides the results lists to search engines in order of highest relevance to the keywords in the query.

Text mining is directed toward specific information provided by the customer search information in search engines. This allows for the scanning of the entire Web to retrieve the cluster content triggering the scanning of specific Web pages within those clusters. The results are pages relayed to the search engines through the highest level of relevance to the lowest. Though, the search engines have the ability to provide links to Web pages by the thousands in relation to the search content, this type of web mining enables the reduction of irrelevant information.

Web text mining is very useful when used in relation to a content database dealing with specific topics. For example, online universities use a library system to recall articles related to their general areas of study. This specific content database enables to pull only the information within those subjects, providing the most specific results of search queries in search engines. This allowance of only the most relevant information being provided gives a higher quality of results. This productivity increase is due directly to use of content mining of text and visuals.

The main uses of this type of data mining are to gather, categorise, organise and provide the best possible information available on the WWW to the user requesting the information. This tool is imperative to scanning the many HTML documents, images, and text provided on Web pages. The resulting information is provided to the search engines in order of relevance giving more productive results of each search.

Web content categorization with a content database is the most significant tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through thousands of results to find the most relevant information to his query. Thousands of results through use of mining text are reduced by this step. This eliminates the frustration and improves the navigation of information on the Web.

91/JNU OLE Data Mining

Business uses of content mining allow the information provided on their sites to be structured in a relevance-order site map. This enables for a customer of the Web site to access specific information without having to search the entire site. With the use of this type of mining, data remains available through order of relativity to the query, thus providing productive marketing.Used as a marketing tool this provides additional traffic to the Web pages of a company’s site based on the amount of keyword relevance the pages offer to general searches. As the second section of data mining, text mining is useful to enhance the productive uses of mining for businesses, Web designers, and search engines operations. Organization, categorization, and gathering of the information provided by the WWW become easier and produce results that are more productive through the use of this type of mining.

In short, the ability to conduct Web content mining enables results of search engines to reduce the flow of customer clicks to a Web site, or particular Web pages of the site, to be accessed number of times in relevance to search queries. The clustering and organisation of Web content in a content database enables effective navigation of the pages by the customer and search engines. Images, content, formats and Web structure are examined to produce a higher quality of information to the user based upon the requests made. Businesses can reduce the use of this text mining to enhance marketing of their sites as well as the products they offer.

4.7 Web StructureMining Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. This structure data is discoverable by the provision of web structure schema through database techniques for Web pages. This connection allows a search engine to pull data relating to a search query directly to the linking Web page from the Web site the content rests upon. This completion takes place through use of spiders scanning the Web sites, retrieving the home page, then, linking the information through reference links to bring forth the specific page containing the desired information.

Structure mining uses two main problems of the World Wide Web due to its vast amount of information. The first of these problems is irrelevant search results. Relevance of search information become misconstrued due to the problem that search engines often only allow for low precision criteria. The second problem is the inability to index the vast amount if information provided on the Web. This causes a low amount of recall with content mining. This minimisation comes in part with the function of discovering the model underlying the Web hyperlink structure provided by Web structure mining.

The main purpose for structure mining is to extract previously unknown relationships between Web pages. This structure data mining offers use for a business to link the information of its own Web site to enable navigation and cluster information into site maps. This allows its users the ability to access the desired information through keyword association and content mining. Hyperlink hierarchy is also determined to path the related information within the sites to the relationship of competitor links and connection through search engines and third party co-links.This enables clustering of connected Web pages to establish the relationship of these pages. On the WWW, the use of structure mining enables the determination of similar structure of Web pages by clustering through the identification of underlying structure. This information can be used to project the similarities of web content. The known similarities then provide ability to maintain or improve the information of a site to enable access of web spiders in a higher ratio. The larger the amount of Web crawlers, the more beneficial to the site because of related content to searches.

In the business world, structure mining can be quite useful in determining the connection between two or more business Web sites. The determined connection brings forth an effective tool for mapping competing companies through third party links such as resellers and customers. This cluster map allows for the content of the business pages placing upon the search engine results through connection of keywords and co-links throughout the relationship of the Web pages. This determined information will provide the proper path through structure mining to enhance navigation of these pages through their relationships and link hierarchy of the Web sites.

92/JNU OLE With improved navigation of Web pages on business Web sites, connecting the requested information to a search engine becomes more effective. This stronger connection allows generating traffic to a business site to provide results that are more productive. The more links provided within the relationship of the web pages enable the navigation to yield the link hierarchy allowing navigation ease. This improved navigation attracts the spiders to the correct locations providing the requested information, proving more beneficial in clicks to a particular site.

Therefore, Web mining and the use of structure mining can provide strategic results for marketing of a Web site for production of sale. The more traffic directed to the Web pages of a particular site increases the level of return visitation to the site and recall by search engines relating to the information or product provided by the company. This also enables marketing strategies to provide results that are more productive through navigation of the pages linking to the homepage of the site itself. Structure mining is a must, in order to truly utilize your website as a business tool web.

4.8 Web Usage Mining Web usage mining is the third category in web mining. This type of web mining allows for the collection of Web access information for Web pages. This usage data provides the paths leading to accessed Web pages. This information is often gathered automatically into access logs via the Web server. CGI scripts offer other useful information such as referrer logs, user subscription information and survey logs. This category is important to the overall use of data mining for companies and their internet/ intranet based applications and information access.

Usage mining allows companies to produce productive information pertaining to the future of their business function ability. Some of this information can be derived from the collective information of lifetime user value, product cross marketing strategies and promotional campaign effectiveness. The usage data that is gathered provides the companies with the ability to produce results more effective to their businesses and increasing of sales. Usage data can also be effective for developing marketing skills that will out-sell the competitors and promote the company’s services or product on a higher level.

Usage mining is valuable not only to businesses using online marketing, but also to e-businesses whose business is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather the important information from customers visiting the site. This enables an in-depth log to complete analysis of a company’s productivity flow. E-businesses depend on this information to direct the company to the most effective Web server for promotion of their product or service.

This web mining also enables Web based businesses to provide the best access routes to services or other advertisements. When a company advertises for services provided by other companies, the usage mining data allows for the most effective access paths to these portals. In addition, there are typically three main uses for mining in this fashion.

The first is usage processing, used to complete pattern discovery. This use is also the most difficult because only bits of information like IP addresses, user information, and site clicks are available. With this minimal amount of information available, it is harder to track the user through a site, being that it does not follow the user throughout the pages of the site.

The second use is content processing, consisting of the conversion of Web information like text, images, scripts and others into useful forms. This helps with the clustering and categorization of Web page information based on the titles, specific content and images available.

Finally, the third use is structure processing. This consists of analysis of the structure of each page contained in a Web site. This structure process can prove to be difficult if resulting in a new structure having to be performed for each page.

93/JNU OLE Data Mining

Analysis of this usage data will provide the companies with the information needed to provide an effective presence to their customers. This collection of information may include user registration, access logs and information leading to better Web site structure, proving to bemost valuable to company online marketing. These present some of the advantages for external marketing of the company’s products, services and overall management.

Internally, usage mining effectively provides information to improvement of communication through intranet communications. Developing strategies through this type of mining will allow for intranet based company databases to be more effective through the provision of easier access paths. The projection of these paths helps to log the user registration information giving commonly used paths the forefront to its access.

Therefore, it is easily determined that usage mining has valuable uses to the marketing of businesses and a direct impact to the success of their promotional strategies and internet traffic. This information is gathered regularly and continues to be analyzed consistently. Analysis of this pertinent information will help companies to develop promotions that are more useful, internet accessibility, inter-company communication and structure, and productive marketing skills through web usage mining.

94/JNU OLE Summary • The knowledge discovery process comprises six phases, such as data selection, data cleansing, enrichment, data transformation or encoding, data mining, and the reporting and display of the discovered information. • Data mining is typically carried out with some end goals or applications. Broadly speaking, these goals fall into the following classes: prediction, identification, classification, and optimisation. • The term “knowledge” is very broadly interpreted as involving some degree of intelligence. • Deductive knowledge deduces new information based on applying pre-specified logical rules of deduction on the given data. • The knowledge discovery process defines a sequence of steps (with eventual feedback loops) that should be followed to discover knowledge in data. • The KDP model consists of a set of processing steps to be followed by practitioners when executing a knowledge discovery project. • Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the WorldWide­ Web. • Web analysis tools analyse and process these web server log files to produce meaningful information. • Web analysis tools offer companies with previously unknown statistics and useful insights into the behaviour of their on-line customers. • Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance of the content to the search query. • Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. • Web usage mining allows for the collection of Web access information for Web pages.

References • Zaptron, 1999. Introduction to Knowledge-based Knowledge Discovery [Online] Available at: . [Accessed 9 September 2011]. • Maimom, O. and Rokach, L., Introduction To Knowledge Discovery In Database [Online PDF] Available at: . [Accessed 9 September 2011]. • Maimom, O. and Rokach, L., 2005. Data mining and knowledge discovery handbook, Springer Science and Business. • Liu, B., 2007. Web data mining: exploring hyperlinks, contents, and usage data, Springer. • http://nptel.iitm.ac.in, 2008. Lecture - 34 Data Mining and Knowledge Discovery [Video Online] Available at :< http://www.youtube.com/watch?v=m5c27rQtD2E>. [Accessed 12 September 2011]. • http://nptel.iitm.ac.in, 2008. Lecture - 35 Data Mining and Knowledge Discovery Part II [Video Online] Available at: . [Accessed 12 September 2011].

Recommended Reading • Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc (IGI). • Han, J., Kamber, M. 2006. Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann. • Chang, G., 2001. Mining the World Wide Web: an information search approach, Springer.

95/JNU OLE Data Mining

Self Assessment 1. During______, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. a. data selection b. data extraction c. data mining d. data warehousing

2. ______and encoding may be done to reduce the amount of data. a. Data selection b. Data extraction c. Data transformation d. Data mining

3. Which of the following is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data? a. Knowledge discovery process b. Data mining c. Data warehousing d. Web mining

4. Match the columns

A. correlate the presence of a set of items with another range of 1. Association rules values for another set of variables

2. Classification B. work from an existing set of events or transactions to create a hierarchies hierarchy of classes

3. Sequential patterns C. sequence of actions or events is sought

D. a given population of events or items can be partitioned 4. Clustering (segmented) into sets of “similar” elements

a. 1-A, 2-B, 3-C, 4-D b. 1-D, 2-A, 3-C, 4-B c. 1-B, 2-A, 3-C, 4-D d. 1-C, 2-B, 3-A, 4-D

5. Which of the following requires a significant project management effort that needs to be grounded in a solid framework? a. Knowledge discovery process b. Knowledge discovery in database c. Knowledge discovery project d. Knowledge discovery

96/JNU OLE 6. ______is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the WorldWide­ Web. a. Data extraction b. Data mining c. Web mining d. Knowledge discovery process

7. ______tools provide companies with previously unknown statistics and useful insights into the behaviour of their on-line customers. a. Web analysis b. Web mining c. Data mining d. Web usage mining

8. What enables e-tailers to leverage their on-line customer data by understanding and predicting the behaviour of their customers? a. Web analysis b. Web mining c. Data mining d. Web usage mining

9. Which of the following is known as text mining? a. Web mining b. Web usage mining c. Web content mining d. Web structure mining

10. What is used to identify the relationship between Web pages linked by information or direct link connection? a. Web mining b. Web usage mining c. Web content mining d. Web structure mining

97/JNU OLE Data Mining

Chapter V Advance topics of Data Mining

Aim

The aim of this chapter is to:

• introduce the concept of spatial data mining and knowledge discovery

• analyse different techniques of spatial data mining and knowledge discovery

• explore the sequence mining

Objectives

The objectives of this chapter are to:

• explicate the cloud model

• describe the elucidate temporal mining

• elucidate database mediators

Learning outcome

At the end of this chapter, you will be able to:

• enlist types of temporal data

• comprehend temporal data processing

• understand the classification techniques

98/JNU OLE 5.1 Introduction The technical progress in computerised data acquisition and storage, results in the growth of vast databases. With the continuous increase and accumulation, the huge amounts of the computerised data have far exceeded human ability to completely interpret and use. These phenomena may be more serious in geo-spatial science. In order to understand and make full use of these data repositories, a few techniques have been tried, for instance, expert system, database management system, spatial data analysis, machine learning, and artificial intelligence. In 1989, knowledge discovery in databases was further proposed. Later, in 1995, data mining also appeared. As both data mining and knowledge discovery in databases virtually point to the same techniques, people would like to call them together, that is data mining and knowledge discovery (DMKD). As 80% data are geo-referenced, the necessity forces people to consider spatial characteristics in DMKD and to further develop a branch in geo-spatial science, that is SDMKD.

Spatial data are more complex, more changeable and larger that common affair datasets. Spatial dimension means each item of data has a spatial reference where each entity occurs on the continuous surface, or where the spatial-referenced relationship exists between two neighbour entities. Spatial data contains not only positional data and attribute data, but also spatial relationships among spatial entities. Moreover, spatial data structure is more complex than the tables in ordinary relational database. Besides tabular data, there are vector and raster graphic data in spatial database. Moreover, the features of graphic data are not explicitly stored in the database. At the same time, contemporary GIS have only basic analysis functionalities, the results of which are explicit. And it is under the assumption of dependency and on the basis of the sampled data that geostatistics estimates at unsampled locations or make a map of the attribute. Because the discovered spatial knowledge can support and improve spatial data-referenced decision- making, a growing attention has been paid to the study, development and application of SDMKD.

5.2 Concepts Spatial data mining and knowledge discovery (SDMKD) is an efficient extraction of hidden, implicit, interesting, previously unknown, potentially useful, ultimately understandable, spatial or non-spatial knowledge (rules, regularities, patterns, constraints) from incomplete, noisy, fuzzy, random and practical data in large spatial databases. It is a confluence of databases technology, artificial intelligence, machine learning, probabilistic statistics, visualisation, information science, pattern recognition and other disciplines. Understood from different viewpoints (Table 5.1), SDMKD shows many new interdisciplinary characteristics.

Viewpoints Spatial data mining and knowledge discovery An interdisciplinary subject, and its theories and techniques are linked with database, computer, statistics, cognitive science, artificial intelligence, Discipline mathematics, machine learning, network, data mining, knowledge discovery database, data analysis, pattern recognition, and so on. Discover unknown and useful rules from huge amount of data via sets of Analysis interactive, repetitive, associative, and data-oriented manipulations. An advanced technique of deductive spatial reasoning. It is discovery, not Logic proof. The knowledge is conditional generic on the mined data. An inductive process that is from concrete data to abstract patterns, from Cognitive science special phenomena to general rules. Data forms: vector, raster, and vector-raster Data structures: hierarchy, relation, net, and object-oriented. Objective data Spatial and non-spatial data contents: positions, attributes, texts, images, graphics, databases, file system, log files, voices, web and multimedia Original data in database, cleaned data in data warehouse, senior commands Systematic information from users, background knowledge from applicable fields.

99/JNU OLE Data Mining

Match the multidisciplinary philosophy of human thinking that suitably Methodology deals with the complexity, uncertainty, and variety when briefing data and representing rules. All spatial data-referenced fields and decision-making process, for example, Application GIS, remote sensing, GPS (global positioning system), transportation, police, medicine, transportation, navigation, robot, and so on.

Table 5.1 Spatial data mining and knowledge discovery in various viewpoints

5.2.1 Mechanism SDMKD is a process of discovering a form of rules along with exceptions at hierarchal view-angles with various thresholds, for instance, drilling, dicing and pivoting on multidimensional databases, spatial data warehousing, generalising, characterising and classifying entities, summarising and contrasting data characteristics, describing rules, predicting future trends and so on. It is also a supportable process of spatial decision-making. There are two mining granularities, such as spatial object granularity and pixel granularity.

It may be briefly partitioned into three big steps, such as data preparation (positioning mining objective, collecting background knowledge, cleaning spatial data), data mining (decreasing data dimensions, selecting mining techniques, discovering knowledge), and knowledge application (interpretation, evaluation and application of the discovered knowledge). In order to discovery the confidential knowledge, it is common to use more than one technique to mine the data sets simultaneously. Moreover, it is also suitable to select the mining techniques on the basis of the given mining task and the knowledge to be discovered.

5.2.2 Knowledge to be Discovered The knowledge is more generalised, condensed and understandable than data. The common knowledge is summarised and generalised from huge amounts of spatial data sets. The amount of spatial data is huge, while the volume of spatial rules is very small. The more generalised the knowledge, the bigger is the contrast. There are many kinds of knowledge that can be mined from large spatial data sets. In table 5.2, these kinds of rules are not isolated, and they often benefit from each other. And various forms, such as linguistic concept, characteristic table, predication logic, semantic network, object orientation, and visualisation, can represent the discovered knowledge. Very complex nonlinear knowledge may be depicted with a group of rules.

Knowledge Description Examples A logic association among different sets of Rain (x, pour) => Landslide (x, spatial entities that associate one or more happen), interestingness is 98%, Association rule spatial objects with other spatial objects. support is 76%, and confidence is Study the frequency of items occurring 51%. together in transactional databases. A common character of a kind of spatial Characterise similar ground entity, or several kinds of spatial entities. A objects in a large set of remote Characteristics rule kind of tested knowledge for summarising sensing images similar features of objects in a target class. A special rule that tells one spatial entity Compare land price in urban from other spatial entity. Different spatial boundary and land price in urban Discriminate rule characteristics rules. Comparison of general centre features of objects between a target class and a contrasting class.

100/JNU OLE A segmentation rule that groups a set of Group crime locations to find objects together by virtue of their similarity distribution patterns. or proximity to each other in the unknown Clustering rule contexts what groups and how many groups will be clustered. Organise data in unsupervised clusters based on attribute values. A rule that defines whether a spatial entity Classify remote sensed images belongs to a particular class or set in the based on spectrum and GIS data. known contexts what classes and how Classification rule many classes will be classified. Organise data in given/supervised classes based on attribute values. A spatiotemporal constrained rule that In summer, landslide disaster relates spatial entities in time continuously, often happens. Land price is the or the function dependency among the function of influential factors and Serial rules parameters. Analyse the trends, deviations, time. regression, sequential pattern, and similar sequences. An inner trend that forecasts future values Forecast the movement trend of some spatial variables when the temporal of landslide based on available or spatial centre is moved to another one. monitoring data. Predictive rule Predict some unknown or missing attribute values based on other seasonal or periodical information. Outliers that are isolated from common rules A monitoring point with much Exceptions or derivate from other data observations bigger movement. very much

Table 5.2 Main spatial knowledge to be discovered

Knowledge is a combination of rule and exception. A spatial rule is a pattern showing the intersection of two or more spatial objects or space-depending attributes according to a particular spacing or set of arrangements (Ester, 2000). Besides the rules, during the discovering process of description or prediction, there may be some exceptions (also named outliers) that derivate so much from other data observations. They identify and explain exceptions (surprises). For example, spatial trend predictive modelling first discovered the centres that are local maximal of some non-spatial attribute, then determined the (theoretical) trend of some non-spatial attribute when moving away from the centres. Finally, few deviations found that some data were away from the theoretical trend. These deviations may arouse suspicious that they are noise, or generated by a different mechanism. How to explain these outliers? Traditionally, outliers’ detection has been studies via statistics, and a number of discordance tests have been developed. Most of them treat outliers as “noise” and they try to eliminate the effects of outliers by removing outliers or develop some outlier-resistant methods. In fact, these outliers prove the rules. In the context of data mining, they are meaningful input signals rather than noise. In some cases, outliers represent unique characteristics of the objects that are important to an organisation. Therefore, a piece of generic knowledge is virtually in the form of rule plus exception.

5.3 Techniques of SDMKD As SDMKD is an interdisciplinary subject, there are various techniques associated with the abovementioned different knowledge. They may include, probability theory, evidence theory, spatial statistics, fuzzy sets, cloud model, rough sets, neural network, genetic algorithms, decision tree, exploratory learning, inductive learning, visualisation, spatial online analytical mining (SOLAM), outlier detection, and so on main techniques are briefed in Table 5.3.

101/JNU OLE Data Mining

Techniques Description Mine spatial data with randomness on the basis of stochastic probabilities. The knowledge is represented as a conditional probability in the contexts of Probability theory given conditions and a certain hypothesis of truth. Also named probability theory and mathematical statistics. Discover sequential geometric rules from disorder data via covariance Spatial statistics structure and variation function in the contexts of adequate samples and background knowledge. Clustering analysis is a branch. Mine spatial data via belief function and possibility function. It is an Evidence theory extension of probability theory, and suitable for stochastic uncertainty based SDMKD. Mine spatial data with fuzziness on the basis of a fuzzy membership function that depicts an inaccurate probability, by using fuzzy comprehensive Fuzzy sets evaluation, fuzzy clustering analysis, fuzzy control, fuzzy pattern recognition and so on. Mine spatial data with incomplete uncertainties via a pair of lower approximation and upper approximation. Rough sets-based SDMKD is Rough sets also a process of intelligent decision-making under the umbrella of spatial data. Mine spatial data via a nonlinear, self-learning, self-suitable, parallel, and dynamic system composed of many linked neurons in a network. The set Neural network of neurons collectively find out rules by continuously learning and training samples in the network. Search the optimised rules from spatial data via three algorithms simulating Genetic algorithms the replication, crossover, and aberrance of biological evolution. Reasoning the rules via rolling down and drilling up a tree-structured map, of which a root node is the mining task, item and branch nodes are mining Decision tree process, and leaf nodes are exact data sets. After pruning, the hierarchical patterns are uncovered. Focusing on data characteristics by analysing topological relationships, Exploratory learning overlaying map-layers, matching images, buffering features (points, lines, polygon) and optimising road. Comes from machine learning. Summarise and generalise spatial data in the context of given background that is from users or a task of SDMKD. Spatial inductive learning The algorithms require that the training data be composed of several tuples with various attributes. And one of the attributes of each tuples is the class label. Visually mine spatial data by computerised visualisation techniques that make abstract data and complicated algorithms change into concrete Visualisation graphics, images, animation and so on, which user may sense directly in eyes. Mine data via online analytical processing and spatial data warehouse. SOLAM Based on multidimensional view and web. It is a tested mining that highlights executive efficiency and timely responsibility to commands. Extract the interesting exceptions from spatial data via statistics, clustering, Outlier detection classification, and regression besides the common rules.

Table 5.3 Techniques to be used in SDMKD

102/JNU OLE 5.3.1 SDMKD-based Image Classification This section presents an approach to combine spatial inductive learning with Bayesian image classification in a loose manner, which takes learning tuple as mining granularities for learning knowledge subdivide classes into subclasses, that is pixel granularity and polygon granularity, and selects class probability values of Bayesian classification, shape features, locations and elevations as the learning attributes. GIS data are used in training area selection for Bayesian classification, generating learning data of two granularities, and testing area selection for classification accuracy evaluation. The ground control points for image rectification are also chosen from GIS data. It implements inductive learning in spatial data mining via C5.0 algorithm on the basis of learning granularities. Figure 5.1 shows the principle of the method.

Training area Remote sensing images GIS database

Bayesian classification

Initial classification result

Polygon granularity data Pixel granularity data

Inductive learning in Inductive learning in polygon granularity polygon granularity

Deductive reasoning Knowledge base

Final classification result Test area

Evaluation of classification accuracy Fig. 5.1 Flow diagram of remote sensing image classification with inductive learning

In Figure 5.1, the remote sensing images are classified initially by Bayesian method before using the knowledge, and the probabilities of each pixel to every class are retained. Secondly, inductive learning is conducted by the learning attributes. Learning with probability simultaneously makes use of the spectral information of a pixel and the statistical information of a class since the probability values are derived from both of them. Thirdly, the knowledge on the attributes of general geometric features, spatial distribution patterns and spatial relationships is further discovered from GIS database, for instance, the polygons of different classes. For example, the water areas in the classification image are converted from pixels to polygons by raster to vector conversion, and then the location and shape features of these polygons are calculated. Finally, the polygons are subdivided into subclasses by deductive reasoning based on the knowledge, for example, class water is subdivided into subclasses such as river, lake, reservoir and pond.

In Figure 5.2, the final classification results are obtained by post-processing of the initial classification results by deductive reasoning. Except the class label attribute, the attributes for deductive reasoning are the same as that in inductive learning. The knowledge discovered by C5.0 algorithm is a group of classification rules and a default class, and each rule is together with a confidence value between 0 and 1. According to how the rule is activated that the attribute values match the conditions of this rule, the deductive reasoning adopts four strategies: • If only one rule is activated, then let the final class be the same as this ruleIf several rules are activated, then let the final class be the same as the rule with the maximum confidenceIf several rules are activated and the confidence values are the same, then let the final class be the same as the rule with the maximum coverage of learning samplesIf no rule is activated, then let the final class be the default class.

103/JNU OLE Data Mining

5.3.2 Cloud Model The cloud model is a model of the uncertainty transition between qualitative and quantitative analysis, that is a mathematical model of the uncertainty transition between a linguistic term of a qualitative concept and its numerical representation data. A piece of cloud is made up of lots of cloud drops, visible shape in a whole, but fuzzy in detail, which is similar to the natural cloud in the sky. Thus, the terminology cloud is used to name the uncertainty transition model proposed here. Any one of the cloud drops is a stochastic mapping in the discourse universe from qualitative concept, that is a specified realisation with uncertain factors. With the cloud model, the mapping from the discourse universe to the interval [0,1] is a one-point to multi-point transition, that is a piece of cloud while not a membership curve. As well, the degree that any cloud drop represents the qualitative concept can be specified. Cloud model may mine spatial data with both fuzzy and stochastic uncertainties, and the discovered knowledge is close to human thinking. Recently, in geo-spatial science, the cloud model has been further explored to spatial intelligent query, image interpretation, land price discovery, factors selection, mechanism of spatial data mining, and landslide-monitoring.

The cloud model well integrates the fuzziness and randomness in a unified way via three numerical characteristics, such as Expected value (Ex), Entropy (En), and Hyper-Entropy (He). In the discourse universe, Ex is the position corresponding to the centre of the cloud gravity, whose elements are fully compatible with the spatial linguistic concept; En is a measure of the concept coverage, that is a measure of the spatial fuzziness, which indicates how many elements could be accepted to the spatial linguistic concept; and He is a measure of the dispersion on the cloud drops, which can also be considered as the entropy of En. In the extreme case, {Ex, 0, 0}, denotes the concept of a deterministic datum where both the entropy and hyper entropy equal to zero. The greater the number of cloud drops, the more deterministic the concept. Figure 5.2 shows the three numerical characteristics of the linguistic term “displacement is 9 millimetres (mm) around”. Given three numerical characteristics Ex, En and He, the cloud generator can produce as many drops of the cloud as you would like.

C7(x) 1

05 3En

He Ex

0 9 mm x Fig. 5.2 Three numerical characteristics

The above three visualisation methods are all implemented with the forward cloud generator in the context of the given {Ex, En, He}. Despite of the uncertainty in the algorithm, the positions of cloud drops produced each time are deterministic. Each cloud drop produced by the cloud generator is plotted deterministically according to the position. On the other hand, it is an elementary issue in spatial data mining that spatial concept is always constructed from the given spatial data, and spatial data mining aims to discover spatial knowledge represented by a cloud from the database. That is, the backward cloud generator is also essential. It can be used to perform the transition from data to linguistic terms, and may mine the integrity {Ex, En, He} of cloud drops specified by many precise data points. Under the umbrella of mathematics, the normal cloud model is common, and the functional cloud model is more interesting. Because it is common and useful to represent spatial linguistic atoms, the normal compatibility cloud will be taken as an example to study the forward and backward cloud generators in the following.

104/JNU OLE The input of the forward normal cloud generator is three numerical characteristics of a linguistic term, (Ex, En, He), and the number of cloud-drops to be generated, N, while the output is the quantitative positions of N cloud drops in the data space and the certain degree that each cloud-drop can represent the linguistic term. The algorithm in details is: • Produce a normally distributed random number En’ with mean En and standard deviation He; • Produce a normally distributed random number x with mean Ex and standard deviation En’; • Calculate • Drop (xi, yi) is a cloud-drop in the discourse universe; • Repeat the above steps until N cloud-drops are generated.

Simultaneously, the input of the backward normal cloud generator is the quantitative positions of N cloud-drops, xi (i=1,…,N), and the certainty degree that each cloud-drop can represent a linguistic term, yi(i=1,…,N), while the output is the three numerical characteristics, Ex, En, He, of the linguistic term represented by the N cloud-drops.

The algorithm in details is: • Calculate the mean value of xi(i=1,…,N), • For each pair of (xi, yi), calculate • Calculate the mean value of Eni (i=1,…, N), Calculate the standard deviation of Eni,

With the given algorithms of forward and backward cloud generators, it is easy to build the mapping relationship inseparably and interdependently between qualitative concept and quantitative data. The cloud model enhances the weakness of rigid specification and too much certainty, which comes into conflict with the human recognition process, appeared in commonly used transition models. Moreover, it performs the interchangeable transition between qualitative concept and quantitative data through the use of strict mathematic functions, the preservation of the uncertainty in transition makes cloud model well meet the need of real life situation. Obviously, the cloud model is not a simple combination of probability methods and fuzzy methods.

5.3.3 Data Fields The obtained spatial data are comparatively incomplete. Each datum in the concept space has its own contribution in forming the conception and the concept hierarchy. Thus, it is necessary for the observed data to radiate their data energies from the sample space to their parent space. In order to describe the data radiation, data field is proposed.

Spatial data radiate energies into data field. The power of the data field may be measured by its potential with a field function. This is similar with the electric charges contribute to form the electric field that every electric charge has effect on electric potential everywhere in the electric field. Therefore, the function of data field can be derived from the physical fields. The potential of a point in the number universe is the sum of all data potentials.

Where, k is a constant of radiation gene, ri is the distance from the point to the position of the ith observed data, ρi is the certainty of the ith data, and N is the amount of the data. With a higher certainty, the data may have greater contribution to the potential in concept space. Besides them, space between the neighbour isopotential, computerised grid density of Descartes coordinate, and so on may also make their contributions to the data field.

105/JNU OLE Data Mining

5.4 Design- and Model-based Approaches to Spatial Sampling The design and model based approaches to spatial sampling are explained below:

5.4.1 Design-based Approach to Sampling The design-based approach or classical sampling theory approach to spatial sampling views the population of values in the region as a set of unknown values, which are apart from any measurement error fixed in value. Randomness enters through the process for selecting the locations to sample. In the case of a discrete population, the target of inference is some global property such as: (eq. 1)

Where N is the number of members of the population, so the above equation is the population mean. If z(k) is binary depending on whether the kth member of the population is of a certain category or not, then the equation is the population proportion of some specified attribute. In the case of a continuous population in region A of area |A| then the equation would be replaced by the integral: (eq. 2)

The design-based approach is principally used for tackling ‘how much’ questions such as estimating the above equations. In principal, individual z(k) could be targets of inference but, because design-based estimators disregard most of the information that is available on where the samples are located in the study area, in practice this is either not possible or gives rise to estimators with poor properties.

k + 1

K

X

k – 1

s – 1 s s + 1

Design-based sample with one observation per strata. In the absence of spatial information the point X in strata (k, s) would have to be estimated using the other point in the strata (k, s) even though in fact the samples in two other strata are closer and may well provide better estimates.

Fig. 5.3 Using spatial information for estimation from a sample

106/JNU OLE 5.4.2 Model-based Approach to Sampling The model-based approach or superpopulation approach to spatial sampling views the population of values in the study region as but one realisation of some stochastic model. The source of randomness that is present in a sample derives from a stochastic model. Again, the target of inference could be (eq.1) or (eq.2).

Under the superpopulation approach, (eq.1) for example now represents the mean of just one realisation. While other realisations to be generated, (eq.1) would differ across realisations. Under this strategy, since (eq.1) is a sum of random variables, (eq.1) is itself a random variable and it is usual to speak of predicting its value. A model-based sampling strategy provides predictors that depend on model properties and are optimal with respect to the selected model. Results may be dismissed if the model is subsequently rejected or disputed.

In the model-based approach it is the mean (μ) of the stochastic model assumed to have generated the realised population that is the usual target of inference rather than a quantity such as (eq.1). This model mean can be considered the underlying signal of which (eq.1) is a ‘noisy’ reflection. Since μ is a (fixed) parameter of the underlying stochastic model, if it is the target of inference, it is usual to speak of estimating its value. In spatial epidemiology for example, it is the true underlying relative risk for an area rather than the observed or realised relative risk revealed by the specific data set that is of interest. Another important target of inference within the model-based strategy is often z(i ) – the value of Z at the location i. Since Z is a random variable it is usual to speak of predicting the value z(i ).

5.5 Temporal Mining A database system consists of three layers: physical, logical, and external. The physical layer deals with the storage of the data, while the logical layer deals with the modelling of the data. The external layer is the layer that the database user interacts with by submitting database queries. A database model depicts the way that the database management system stores the data and manages their relations. The most prevalent models are the relational and the object-oriented. For the relational model, the basic construct at the logical layer is the table, while for the object- oriented model it is the object. Because of its popularity, we will use the relational model in this book. Data are retrieved and manipulated in a relational database, using SQL. A relational database is a collection of tables, also known as relations. The columns of the table correspond to attributes of the relational variable, while the rows, also known as tuples, correspond to the different values of the relational variable. Other frequently used database terms are the following: • Constraint: A rule imposed on a table or a column. • Trigger: The specification of a condition whose occurrence in the database causes the appearance of an external event, such as the appearance of a popup. • View: A stored database query that hides rows and/or columns of a table.

Temporal databases are databases that contain time-stamping information. Time-stamping can be done as follows: • With a valid time, this is the time that the element information is true in the real world. • With a transaction time, this is the time that the element information is entered into the database. • Bi-temporally, with both a valid time and a transaction time.

Time-stamping is usually applied to each tuple; however, it can be applied to each attribute as well. Databases that support time can be divided into four categories: • Snapshot databases: They keep the most recent version of the data. Conventional databases fall into this category. • Rollback databases: They support only the concept of transaction time. • Historical databases: They support only valid time. • Temporal databases: They support both valid and transaction times.

107/JNU OLE Data Mining

The two types of temporal entities that can be stored in a database are: • Interval: A temporal entity with a beginning time and an ending time. • Event: A temporal entity with an occurrence time.

In addition to interval and event, another type of a temporal entity that can be stored in a database is a time series. A time series consists of a series of real valued measurements at regular intervals. Other frequently used terms related to temporal data are the following:

Granularity It describes the duration of the time sample/measurement. Anchored data can be used to describe either the time of an occurrence of an Anchored Data event or the beginning and ending times of an interval. They are used to represent the duration of an Unanchored data Interval. The replacement of two tuples A and B with a single tuple C, where A and B have identical non-temporal attributes and adjacent or overlapping temporal Data coalescing intervals. C has the same non-temporal attributes as A and B, while its temporal interval is the union of A’s and B’s temporal intervals.

Table 5.4 Terms related to temporal data

5.5.1 Time in Data Warehouses A data warehouse (DW) is a repository of data that can be used in support of business decisions. Many data warehouses have a time dimension and thus they support the idea of valid time. In addition, data warehouses contain snapshots of historical data and inherently support the idea of transaction time. Therefore, a DW can be considered as a temporal database, because it inherently contains bi-temporal time-stamping. Time affects the structure of the warehouse too. This is done by gradually increasing the granularity coarseness as we move further back in the time dimension of the data. Data warehouses, therefore, inherently support coalescing. Despite the fact that data warehouses inherently support the notion of time, they are not equipped to deal with temporal changes in master data.

5.5.2 Temporal Constraints and Temporal Relations Temporal constraints can be either qualitative or quantitative. Regarding quantitative temporal constraints, variables take their values over the set of temporal entities and the constraints are imposed on a variable by restricting its set of possible values. In qualitative temporal constraints, variables take their value from a set of temporal relations.

5.5.3 Requirements for a Temporal Knowledge-Based Management System Following is a list of requirements that must be fulfilled by a temporal knowledge-based management system: • To be able to answer real-world queries, it must be able to handle large temporal data amounts. • It must be able to represent and answer queries about both quantitative and qualitative temporal relationships. • It must be able to represent causality between temporal events. • It must be able to distinguish between the history of an event and the time that system learns about the event. In other words, it must be able to distinguish between valid and transaction times. • It must offer an expressive query language that also allows updates. • It must be able to express persistence in a parsimonious way. In other words, when an event happens, it should change only the parts of the system that are affected by the event.

5.6 Database Mediators Temporal database mediator is used to discover temporal relations, implement temporal granularity conversion, and also discover semantic relationships. This mediator is a computational layer placed between the user interface and the database. Following figure shows the different layers of processing of the user query: “Find all patients who were

108/JNU OLE admitted to the hospital in February.” The query is submitted in natural language in the user interface, and then, in the Temporal Mediator (TM) layer, it is converted to an SQL query. It is also the job of the Temporal Mediator to perform temporal reasoning to find the correct beginning and end dates of the SQL query.

User

UI: Find all patients who were admitted in February

TM: Perform temporal reasoning; convert to SQL SQL query

Patient records

Fig. 5.4 Different layers of user query processing

5.6.1 Temporal Relation Discovery The discovery of temporal relations has applications in temporal queries, constraint, and trigger implementation.

The following are some temporal relations: • A column constraint that implements an after relation between events. • A query about a before relation between events. • A query about an equal relationship between intervals. • A database trigger about a meets relationship between an interval and an event or between two intervals.

5.6.2 Semantic Queries on Temporal Data Traditional temporal data mining focuses heavily on the harvesting of one specific type of temporal information: cause/effect relationships through the discovery of association rules, classification rules, and so on. However, in this book we take a broader view of temporal data mining, one that encompasses the discovery of structural and semantic relationships, where the latter is done within the context of temporal ontologies. The discovery of structural relationships will be discussed in the next chapter; however, the discovery of semantic relationships is discussed in this section, because it is very closely intertwined with the representation of time inside the database. Regarding terminology, ontology is a model of real-world entities and their relationships, while the term semantic relationship denotes a relationship that has a meaning within ontology. Ontologies are being developed today for a large variety of fields ranging from the medical field to geographical systems to business process management. As a result, there is a growing need to extract information from database systems that can be classified according to the ontology of the field. It is also desirable that this information extraction is done using natural language processing (NLP).

109/JNU OLE Data Mining

5.7 Temporal Data Types Temporal data can be distinguished into three types: • Time series: They represent ordered real-valued measurements at regular temporal intervals. A time series X =

{x1, x2,…, xn} for t = t1, t2,…, tn is a discrete function with value x1 for time t1, value x2 for time t2, and so on. Time series data consist of varying frequencies where the presence of noise is also a common phenomenon. A time series can be multivariate or univariate. A multivariate time series is created by more than one variable, while in a univariate time series there is one underlying variable. Another differentiation of time series is stationary and non-stationary. A stationary time series has a mean and a variance that does not change over time, while a non-stationary time has no salient mean and can decrease or increase over time. • Temporal sequences: These can be timestamped at regular or irregular time intervals. • Semantic temporal data:These are defined within the context of ontology.

Finally, an event can be considered as a special case of a temporal sequence with one time-stamped element. Similarly, a series of events is another way to denote a temporal sequence, where the elements of the sequence are of the same types semantically, such as earthquakes, alarms and so on.

5.8 Temporal Data Processing The temporal data mining process is explained in detail below.

5.8.1 Data Cleaning In order for the data mining process to yield meaningful results, the data must be appropriately preprocessed to make it “clean,” which can have different meanings depending on the application. Here, we will deal with two aspects of data “cleanliness”: handling missing data and noise removal.

Missing data A problem that quite often complicates time series analysis is missing data. There are several reasons for this, such as malfunctioning equipment, human error, and environmental conditions. The specific handling of missing data depends on the specific application and the amount and type of missing data. There are two approaches to deal with missing data: • Not filling in the missing values: For example, in the similarity computation section of this chapter, we discuss a method that computes the similarities of two time series by comparing their local slopes. If a segment of data is missing in one of the two time series, we simply ignore that piece in our similarity computation. • Filling in the missing value with an estimate: For example, in the case of a time series and for small numbers of contiguous missing values, we can use data interpolation to create an estimate of the missing values, using adjacent values as a form of imputation. The greater the distance between adjacent values used to estimate the intermediate missing values, the greater is the interpolation error. The allowable interpolation error and, therefore, the interpolation distance vary from application to application. The simplest type of interpolation that gives reasonable results is linear interpolation.

Noise removal Noise is defined as random error that occurs in the data mining process. It can be due to several factors, such as faulty measurement equipment and environmental factors. The two methods of dealing with noise in data mining are: binning and moving-average smoothing. In binning, the data are divided into buckets or bins of equal size. Then the data are smoothed by using either mean, the median, or the boundaries of the bin.

110/JNU OLE 5.8.1 Data Normalisation In data normalisation, the data are scaled so that they fall within a prespecified range, such as [0–1]. Normalisation allows data to be transformed to the same “scale” and, therefore, allows direct comparisons among their values. Normalisation can be examined in two easy types:

• Min-max normalisation: To do this type of normalisation, we need to know the minimum (xmin) and the maximum

(xmax) of the data: • Z-score normalisation: Here, the mean and the standard deviation of the data are used to normalise them:

Z-score normalisation is useful, in cases of outliers in the data, that is, data points with extremely low or high values that are not representative of the data and could be due to measurement error.

As we can see, both types of normalisation preserve the shape of the original time series, but z-score normalisation follows the shape more closely.

5.9 Temporal Event Representation Temporal even representation is explained below.

5.9.1 Event Representation Using Markov Models In many areas of research, a question that often arises is, what is the probability that a specific event B will happen after an event A? For example, in earthquake research, one would like to determine the probability that there will be a major earthquake, after certain types of smaller earthquakes. In speech recognition, one would like to know, for example, the probability that the vowel e will appear after the letter h. Markov models offer a way to do the same.

0.6 0.9

Stay enrolled 0.2 Enrolment in 0.1 Change in electrical business enrolment again Engineering

0.2

0.8

Enrolment in mechanical 0.2 engineering

Fig. 5.5 A Markov diagram that describes the probability of program enrolment changes

Here, we will use the fact that a Markov model can be represented using a graph that consists of vertices and arcs. The vertices show states, while the arcs show transitions between the states.

A more specific example is shown in Figure 5.5, which shows the probability of changing majors in college. Thus, we see that the probability of staying enrolled in electrical engineering is 0.6, while the probability of changing major from electrical engineering to business studies is 0.2 and the probability of switching from electrical engineering to mechanical engineering is also 0.2. As we can see, the sum of probabilities leaving the “Electrical eng. enrolment” is 1.

111/JNU OLE Data Mining

5.9.2 A Formalism for Temporal Objects and Repetitions Allen’s temporal algebra is the most widely accepted way to characterise relationships between intervals. Some of these relationships are before meets, starts, during, overlaps, finishes, and equals, and they are respectively written as .b., .m., .s., .d., .o., .f., and .e. Their inverse relationships can be expressed with a negative power of −1. For example, .e−1 denotes the inverse of the equal’s relationship, which is not equal.

5.10 ClassificationT echniques In classification, we assume we have some domain knowledge about the problem we are trying to solve. This domain knowledge can come from domain experts or sample data from the domain, which constitute the training data set. In this section, we will examine classification techniques that were developed for nontemporal data. However, they can be easily extended, as the examples will demonstrate, to temporal data.

5.10.1 Distance-Based Classifier As the name implies, the main idea in this family of classifiers is to classify a new sample to the nearest class. Each sample is represented by N features. There are different implementations of this type of classifier, based on (1) how each class is represented and (2) how the distance is computed from each class. There are two variations on how to represent each class. First, in the K–Nearest Neighbours approach, all samples of a class are used to represent the class. Second, in the exemplar-based approach, each class is represented by a representative sample, commonly referred to as the exemplar of this class. The most widely used metric to compute the distance of the unknown class sample to the existing classes is the Euclidean distance. However, for some classification techniques other distance measures are more suitable.

As an example, we assume we have a new sample x of unknown class with N features, where each ith feature is th denoted as xi. Then the Euclidean distance from a class sample, whose the i feature is denoted as yi, is defined as

K-Nearest neighbours In this type of classifier, the domain knowledge of each class is represented by all of its samples. The new sample X, whose class is unknown, is classified to the class that has the K nearest neighbours to it. K can be 1, 2, 3, and so on. Because all the training samples are stored for each class, this is a computationally expensive method.

Several important considerations about the nearest neighbourhood algorithm are described below:The algorithm’s performance is affected by the choice of K. If K is small, then the algorithm can be affected by noise points. If K is too large, then the nearest neighbours can belong to many different classes. • The choice of the distance measure can also affect the performance of the algorithm. Some distance measures are influenced by the dimensionality of the data. For example, the Euclidean distance’s classifying power is reduced as the number of attributes increases. • The error of the K-NN algorithm asymptotically approaches that of the Bayes error. • K-NN is particularly applicable to classification problems with multimodal classes.

5.10.2 Bayes Classifier The Bayes classifier is a statistical classifier, because classification is based on the computation of probabilities, and domain knowledge is also expressed in a probabilistic way.

5.10.3 Decision Tree Decision trees are widely used in classification because they are easy to construct and use. The first step in decision tree classification is to build the tree. A popular tree construction algorithm is ID3, which uses information theory as its premise. An improvement to the ID3 algorithm that can handle non-categorical data is C4.5. A critical step in the construction of the decision tree is how to order the splitting of the features in the tree. In the ID3 algorithm,

the guide for the ordering of the features is entropy. Given a data set, where each value has probabilities p1, p2,…,

pn, its entropy is defined as

112/JNU OLE Entropy shows the amount of randomness in a data set and varies from 0 to 1. If there is no amount of uncertainty in the data, then the entropy is 0. For example, this can happen if one value has probability 1 and the others have probability 0. If all values in the dataset are equally probable, then the amount of randomness in the data is maximised, and entropy becomes 1. In this case, the amount of information in the data set is maximised. Here is the main idea of the ID3 algorithm: A feature is chosen as the next level of the tree if its splitting produces the most information gain.

5.10.4 Neural Networks in Classification An artificial neural network (ANN) is a computational model that mimics the human brain in the sense that it consists of a set of connected nodes, similar to neurons. The nodes are connected with weighted arcs. The system is adaptive because it can change its structure based on information that flows through it. In addition to nodes and arcs, neural networks consist of an input layer, hidden layers, and an output layer. The simplest form of ANN is the feed-forward ANN, where information flows in one direction and the output of each neuron is calculated by summing the weighted signals from incoming neurons and passing the sum through an activation function. A commonly used activation function is the sigmoid function:

A common training process for feed forward neural networks is the back-propagation process, where we go back in the layers and modify the weights. The weight of each neuron is adjusted such that its error is reduced, where a neuron’s error is the difference between its expected and actual outputs. The most well-known feedforward ANN is the perceptron, which consists of only two layers (no hidden layers) and works as a binary classifier. If three or more layers exist in the ANN (at least one hidden layer) then the perceptron is known as the multilayer perceptron.

Another type of widely used feedforward ANN is the radial-basis function ANN, which consists of three layers and the activation function is a radial-basis function (RBF). This type of function, as the name implies, has radial symmetry such as a Gaussian function and allows a neuron to respond to a local region of the feature space. In other words, the activation of a neuron depends on its distance from a centre vector. In the training phase, the RBF centres are chosen to match the training samples. Neural network classification is becoming very popular and one of its advantages is that it is resistant to noise. The input layer consists of the attributes used in the classification, and the output nodes correspond to the classes. Regarding hidden nodes, too many nodes lead to over-fitting while too fewer nodes can lead to reduced classification accuracy. Originally, each arc is assigned a random weight, which can then be modified in the learning process.

5.11 Sequence Mining Sequence mining is explained in following points.

5.11.1 Apriori Algorithm and Its Extension to Sequence Mining A sequence is a time-ordered list of objects, in which each object consists of an item set, with an item set consisting of all items that appear (or were bought) together in a transaction (or session). Note that the order of items in an item set does not matter. A sequence database consists of tuples, where each tuple consists of a customer id, transaction id, and an item set. The purpose of sequence mining is to find frequent sequences that exceed a user-specified support threshold. Note that the support of a sequence is the percentage of tuples in the database that contain the sequence. An example of a sequence is {(ABD), (CDA)}, where items ABD indicate the items that were purchased in the first transaction of a customer and CDA indicate the items that were purchased in a later transaction of the same customer.

Apriori algorithm is the most extensively used algorithm for the discovery of frequent item sets and association rules. The main concepts of the Apriori algorithm are as follows: • Any subset of a frequent item set is a frequent item set.

• The set of item sets of size k will be called Ck.

• The set of frequent item sets that also satisfy the minimum support constraint is known as Lk. This is the seed set used for the next pass over the data.

113/JNU OLE Data Mining

• Ck+1 is generated by joining Lk with itself. The item sets of each pass have one more element than the item sets of the previous pass.

• Similarly, Lk+1 is then generated by eliminating from Ck those elements that do not satisfy the minimum support rule. As the candidate sequences are generated by starting with the smaller sequences and progressively increasing the sequence size, Apriori is called a breadth first approach. An example is shown below, where we require the minimum support to be two. The left column shows the transaction ID and the right column shows the items that were purchased.

5.11.2 The GSP Algorithm GSP was proposed by the same team as the Apriori algorithm, and it can be 20 times faster than the Apriori algorithm because it has a more intelligent selection of candidates for each step and introduces time constraints in the search space. GSP is formulated to address three cases: • The presence of time constraints that specify a maximum time between adjacent elements in a pattern. The time window can apply to items that are bought in the same transaction or in different transactions. • The relaxation of the restriction that the items in an element of a sequential pattern must come from the same transaction. • Allowing elements in a sequence to consist of items that belong to taxonomy.

The algorithm has very good scalability as the size of the data increases. The GSP algorithm is similar to the Apriori algorithm. It makes multiple passes over the data as shown below: • In the first pass, it finds the frequent sequences. In other words, it finds the sequences that have minimum support. This becomes the next seed set for the next iteration. • At each next iteration, each candidate sequence has one more item than the seed sequence.

Two key innovations in the GSP algorithm are how candidates are generated and how candidates are counted.

Candidate Generation

The main idea here is the definition of a contiguous subsequence. Assume we are given a sequence S = {S1, S2,….

SN}, then a subsequence C is defined as a contiguous sequence if any of the following constraints are satisfied:

• C is extracted from S by dropping an item from either the beginning or the end of the sequence (S1 or SN). • C is extracted from S by dropping an item from a sequence element. The element must have at least 2 items. • C is a contiguous subsequence of C′ and C′ is a contiguous subsequence of S.

114/JNU OLE Summary • The technical progress in computerised data acquisition and storage results in the growth of vast databases. With the continuous increase and accumulation, the huge amounts of the computerised data have far exceeded human ability to completely interpret and use. • Spatial data mining and knowledge discovery (SDMKD) is the efficient extraction of hidden, implicit, interesting, previously unknown, potentially useful, ultimately understandable, spatial or non-spatial knowledge (rules, regularities, patterns, constraints) from incomplete, noisy, fuzzy, random and practical data in large spatial databases. • Cloud model is a model of the uncertainty transition between qualitative and quantitative analysis, that is a mathematical model of the uncertainty transition between a linguistic term of a qualitative concept and its numerical representation data. • The design-based approach or classical sampling theory approach to spatial sampling views the population of values in the region as a set of unknown values which are, apart from any measurement error, fixed in value. • The model-based approach or superpopulation approach to spatial sampling views the population of values in the study region as but one realisation of some stochastic model. • A database model depicts the way that the database management system stores the data and manages their relations. • A data warehouse (DW) is a repository of data that can be used in support of business decisions. Many data warehouses have a time dimension and therefore, they support the idea of valid time. • Temporal constraints can be either qualitative or quantitative. • Temporal database mediator is used to discover temporal relations, implement temporal granularity conversion, and also discover semantic relationships. • In classification, we assume we have some domain knowledge about the problem we are trying to solve. • A sequence is a time-ordered list of objects, in which each object consists of an item set, with an item set consisting of all items that appear together in a transaction.

References • Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier. • Mitsa, T., 2009. Temporal Data Mining, Chapman & Hall/CRC. • Dr. Krie, H. P., Spatial Data Mining [Online] Available at: . [Accessed 9 September 2011]. • Lin W., Orgun M, A. and Williams G. J., An Overview of Temporal Data Mining [Online PDF] Available at: . Accessed 9 September 2011]. • University of Magdeburg, 2007. 3D Spatial Data Mining on Document Sets [Video Online] Available at :< http://www.youtube.com/watch?v=jJWl4Jm-yqI>. [Accessed 12 September 2011]. • Berlingerio, M., 2009. Temporal mining for interactive workflow data analysis [Video Online] Available at: . [Accessed 12 September].

Recommended Reading • Pujari, A. K., 2001. Data mining techniques, 4th ed., Universities Press. • Stein, A., Shi, W. and Bijker, W., 2008. Quality aspects in spatial data mining, CRC Press. • Roddick, J. F. and Hornsby, K., 2001. Temporal, spatial, and spatio-temporal data mining, Springer.

115/JNU OLE Data Mining

Self Assessment 1. Which of the following statement is true? a. The technical progress in computerised data acquisition and storage results in the growth of vast web mining. b. In 1998, knowledge discovery in databases was further proposed. c. Metadata are more complex, more changeable and bigger that common affair datasets. d. Spatial data includes not only positional data and attribute data, but also spatial relationships among spatial entities.

2. Besides tabular data, there are vector and raster graphic data in ______. a. metadata b. database c. spatial database d. knowledge discovery

3. Which of the following is a process of discovering a form of rules plus exceptions at hierarchal view-angles with various thresholds? a. SDMKD b. DMKD c. KDD d. EM

4. The ______is more generalised, condensed and understandable than data. a. knowledge b. database c. mining d. extraction

5. Match the columns A. Mine spatial data with randomness on the 1. Probability theory basis of stochastic probabilities. 2. Spatial statistics B. Clustering analysis is a branch. C. Mine spatial data via belief function and pos- 3. Evidence theory sibility function. D. Mine spatial data with fuzziness on the basis 4. Fuzzy sets of a fuzzy membership function that depicts an inaccurate probability. a. 1-A, 2-B, 3-C, 4-D b. 1-D, 2-A, 3-C, 4-B c. 1-B, 2-A, 3-C, 4-D d. 1-C, 2-B, 3-A, 4-D

116/JNU OLE 5. Design-based approach is also called ______. a. classical sampling theory approach b. model based approach c. DMKD d. SDMKD

6. Which of the following is a model of the uncertainty transition between qualitative and quantitative analysis? a. Dimensional modelling b. Cloud model c. Web mining d. SDMKD

7. Superpopulation approach is also known as ______. a. classical sampling theory approach b. model based approach c. DMKD d. SDMKD

8. A database system consists of three layers: physical, logical, and ______. a. internal b. inner c. central d. external

9. Match the columns 1. Snapshot database A. They support only valid time. 2. Rollback database B. They support both valid and transaction times. 3. Historical database C. They keep the most recent version of the data 4. Temporal database D. They support only the concept of transaction time. a. 1-C, 2-D, 3-A, 4-B b. 1-A, 2-B, 3-C, 4-D c. 1-D, 2-C, 3-B, 4-A d. 1-B, 2-A, 3-D, 4-C

117/JNU OLE Data Mining

Chapter VI Application and Trends of Data Mining

Aim

The aim of this chapter is to:

• introduce applications of data mining

• analyse spatial data mining

• explore the multimedia data mining

Objectives

The objectives of this chapter are to:

• explicate text mining

• describe query processing techniques

• elucidate trends in data mining

Learning outcome

At the end of this chapter, you will be able to:

• comprehend system products and research prototypes

• enlist additional themes on data mining

• understand the use of data mining in different sectors such as education, telecommunication, finance and so on

118/JNU OLE 6.1 Introduction Data mining is the process of extraction of interesting (nontrivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. It is the set of activities used to find new, hidden or unexpected patterns in data or unusual patterns in data. Using information contained within data warehouse, data mining can often provide answers to questions about an organisation that a decision maker has previously not thought to ask. • Which products should be promoted to a particular customer? • What is the probability that a certain customer will respond to a planned promotion? • Which securities will be most profitable to buy or sell during the next trading session? • What is the likelihood that a certain customer will default or pay back a schedule? • What is the appropriate medical diagnosis for this patient?

These types of questions can be answered easily if the information hidden among the petabytes of data in your databases can be located and utilised. In the following paragraphs, we will discuss about the applications and trends in the fields of data mining.

6.2 Applications of Data Mining An important feature of object-relational and object-oriented databases is their capability of storing, accessing and modelling complex structure-valued data, such as set- and list-valued data and data with nested structures. A set- valued attribute may be of homogeneous or heterogeneous type. Typically, set-valued data can be generalised by: • Generalisation of each value in the set to its corresponding higher-level concept • Derivation of the general behaviour of the set, such as the number of elements in the set, the types or value ranges in the set, the weighted average for numerical data, or the major clusters formed by the set.

For Example, Generalisation of a set-valued attribute Suppose that the expertise of a person is a set-valued attribute containing the set of values {tennis, hockey, NFS, violin, prince of Persia}. This set can be generalised to a set of high-level concepts, such as {sports, music, computer games} or into the number 5 (that is the number of activities in the set). Moreover, a count can be associated with a generalised value to indicate how many elements are generalised to that value, as in {sports (3), music (1), computer games (1)}, where sports (3) indicates three kinds of sports, and so on.

6.2.1 Aggregation and Approximation in Spatial and Multimedia Data Generalisation Aggregation and approximation are another important means of generalisation. They are especially useful for generalising attributes with large sets of values, complex structures, and spatial or multimedia data.

Example: Spatial aggregation and approximation Suppose that we have different pieces of land for several purposes of agricultural usage, such as the planting of vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by a spatial merge. However, such a piece of agricultural land may contain highways, houses, and small stores. If the majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole region can be claimed as an agricultural area by approximation.

6.2.2 Generalisation of Object Identifiers and Class/Subclass Hierarchies An object identifier can be generalised as follows: • First, the object identifier is generalised to the identifier of the lowest subclass to which the object belongs. • The identifier of this subclass can then, in turn, be generalised to a higher level class/subclass identifier by climbing up the class/subclass hierarchy. • Similarly, a class or a subclass can be generalised to its corresponding super class (es) by climbing up its associated class/subclass hierarchy.

119/JNU OLE Data Mining

6.2.3 Generalisation of Class Composition Hierarchies An attribute of an object may be composed of or described by another object, some of whose attributes may be in turn composed of or described by other objects, thus forming a class composition hierarchy. Generalisation on a class composition hierarchy can be viewed as generalisation on a set of nested structured data (which are possibly infinite, if the nesting is recursive).

6.2.4 Construction and Mining of Object Cubes In an object database, data generalisation and multidimensional analysis are not applied to individual objects, but to classes of objects. Since a set of objects in a class may share many attributes and methods, and the generalisation of each attribute and method may apply a sequence of generalisation operators, the major issue becomes how to make the generalisation processes cooperate among different attributes and methods in the class(es).

6.2.5 Generalisation-Based Mining of Plan Databases by Divide-and-Conquer A plan consists of a variable sequence of actions. A plan database, or simply a planbase, is a large collection of plans. Plan mining is the task of mining significant patterns or knowledge from a planbase.

6.3 Spatial Data Mining A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases.

Spatial Mining (ODM + Spatial engine)

Original data

Spatial Mining Spatial thematic Functions data layers

Materialised data (spatial binning, proximity, collocation materialisation)

ODM engine Mining results

Fig. 6.1 Spatial mining (Source: http://dataminingtools.net/wiki/applications_of_data_mining.php)

120/JNU OLE 6.3.1 Spatial Data Cube Construction and Spatial OLAP As with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time variant, and non-volatile collection of both spatial and non-spatial data in support of spatial data mining and spatial-data-related decision-making processes.

There are three types of dimensions in a spatial data cube: • A non-spatial dimension • A spatial-to-non-spatial dimension • A spatial-to-spatial dimension

We can distinguish two types of measures in a spatial data cube: • A numerical measure contains only numerical data • A spatial measure contains a collection of pointers to spatial objects

6.3.2 Mining Spatial Association and Co-location Patterns For mining spatial associations related to the spatial predicate close to, we can first collect the candidates that pass the minimum support threshold by: • Applying certain rough spatial evaluation algorithms, for example, using an MBR structure (which registers only two spatial points rather than a set of complex polygons), • Evaluating the relaxed spatial predicate, g close to, which is a generalised close to covering a broader context that includes close to, touch, and intersect.

Spatial clustering methods: Spatial data clustering identifies clusters or densely populated regions, according to some distance measurement in a large, multidimensional data set.

Spatial classification and spatial trend analysis: Spatial classification analyses spatial objects to derive classification schemes in relevance to certain spatial properties, such as the neighbourhood of a district, highway, or river.

Example: Spatial classification Suppose that you would like to classify regions in a province into rich versus poor according to the average family income. In doing so, you would like to identify the important spatial-related factors that determine a region’s classification. Many properties are associated with spatial objects, such as hosting a university, containing interstate highways, being near a lake or ocean, and so on. These properties can be used for relevance analysis and to find interesting classification schemes. Such classification schemes may be represented in the form of decision trees or rules.

6.3.3 Mining Raster Databases Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their compositions, such as networks or partitions. Typical examples of such data include maps, design graphs, and 3-D representations of the arrangement of the chains of protein molecules.

6.4 Multimedia Data Mining A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages similarity search in multimedia data when searching for similarities in multimedia data, we can search on either the data description or the data content approaches:

121/JNU OLE Data Mining

• Colour histogram–based signature • Multifeature composed signature • Wavelet-based signature • Wavelet-based signature with region-based granularity

6.4.1 Multidimensional Analysis of Multimedia Data To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data.

A multimedia data cube can contain additional dimensions and measures for multimedia information, such as colour, texture, and shape.

6.4.2 Classification and Prediction Analysis of Multimedia Data Classification and predictive modelling can be used for mining multimedia data, especially in scientific research, such as astronomy, seismology, and geoscientific research.

Example: Classification and prediction analysis of astronomy data Taking sky images that have been carefully classified by astronomers as the training set, we can construct models for the recognition of galaxies, stars, and other stellar objects, based on properties like magnitudes, areas, intensity, image moments, and orientation. A large number of sky images taken by telescopes or space probes can then be tested against the constructed models in order to identify new celestial bodies. Similar studies have successfully been performed to identify volcanoes on Venus.

6.4.3 Mining Associations in Multimedia Data • Associations between image content and non-image content features • Associations among image contents that are not related to spatial relationships • Associations among image contents related to spatial relationships

6.4.4 Audio and Video Data Mining An incommensurable amount of audiovisual information is now available in digital form, in digital archives, on the World Wide Web, in broadcast data streams, and in personal and professional databases, and hence it is needed to mine them.

6.5 Text Mining Text Data Analysis and Information retrieval (IR) is a field that has been developing in parallel with database systems for many years. Basic Measures for Text Retrieval: Precision and Recall;

122/JNU OLE Unrestricted exploratory Collect data freedom

Parse Apply text mining algorithms

Repository Optimise

View results

Fig. 6.2 Text mining (Source: http://dataminingtools.net/wiki/applications_of_data_mining.php)

Precision: This is the percentage of retrieved documents that are in fact relevant to the query (that is “correct” responses). Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved.

It is formally defined as Text Retrieval Methods • Document selection methods • Document ranking methods

Text indexing techniques • Inverted indices • Signature files.

6.6 Query Processing Techniques Once an inverted index is created for a document collection, a retrieval system can answer a keyword query quickly by looking up at which documents contain the query keywords.

6.6.1 Ways of dimensionality Reduction for Text • Latent Semantic Indexing • Locality Preserving Indexing • Probabilistic Latent Semantic Indexing

6.6.2 Probabilistic Latent Semantic Indexing schemas • Keyword-Based Association Analysis • Document Classification Analysis • Document Clustering Analysis

123/JNU OLE Data Mining

6.6.3 Mining the World Wide Web The World Wide Web serves as a huge, widely distributed, global information service center for news, advertisements, consumer information, financial management, education, government, e-commerce, and many other information services. The Web also contains a rich and dynamic collection of hyperlink information and Web page access and usage information, providing rich sources for data mining.

6.6.4 Challenges The Web seems to be too huge for effective data warehousing and data mining • The complexity of Web pages is far greater than that of any traditional text document collection • The Web is a highly dynamic information source • The Web serves a broad diversity of user communities • Only a small portion of the information on the Web is truly relevant or useful

Authoritative Web pages: Suppose you would like to search for Web pages relating to a given topic, such as financial investing. In addition, to retrieving pages that are relevant, you also hope that the pages retrieved will be of high quality, or authoritative on the topic.

6.7 Data Mining for Healthcare Industry The past decade has seen an explosive growth in biomedical research, ranging from the development of new pharmaceuticals and in cancer therapies to the identification and study of human genome by discovering large scale sequencing patterns and gene functions. Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities as well as approaches for disease diagnosis, prevention and treatment.

6.8 Data Mining for Finance Most banks and financial institutions offer a wide variety of banking services (such as checking, saving, and business and individual customer transactions), credit (such as business, mortgage, and automobile loans), and investment services (such as mutual funds). Some also offer insurance services and stock services. Financial data collected in the banking and financial industry is often relatively complete, reliable and high quality, which facilitates systematic data analysis and data mining. For example, it can also help in fraud detection by detecting a group of people who stage accidents to collect on insurance money.

6.9 Data Mining for Retail Industry Retail industry collects huge amount of data on sales, customer shopping history, goods transportation and consumption and service records and so on. The quantity of data collected continues to expand rapidly, especially due to the increasing ease, availability and popularity of the business conducted on web, or e-commerce. Retail industry provides a rich source for data mining. Retail data mining can help identify customer behaviour, discover customer shopping patterns and trends, enhance the quality of customer service, achieve better customer retention and satisfaction, improve goods consumption ratios design, more effective goods transportation and distribution policies and reduce the cost of business.

6.10 Data Mining for Telecommunication The telecommunication industry has rapidly evolved from offering local and long distance telephone services to offer many other comprehensive communication services including voice, fax, pager, cellular phone, images, e-mail, computer and web data transmission and other data traffic. The integration of telecommunication, computer network, Internet and numerous other means of communication and computing are underway. Moreover, with the deregulation of the telecommunication industry in many countries and the development of new computer and communication technologies, the telecommunication market is rapidly expanding and highly competitive. This creates a great demand from data mining in order to help understand business involved, identify telecommunication patterns, catch fraudulent activities, make better use of resources, and enhance the quality of service

124/JNU OLE 6.11 Data Mining for Higher Education An important challenge that higher education faces today is predicting paths of students and alumni. Which student will enrol in particular course programs? Who will need an additional assistance in order to graduate? Meanwhile additional issues such as enrolment management and time-to degree, continue to exert pressure on colleges to search for new and faster solutions. Institutions can better address these students and alumni through the analysis and presentation of data. Data mining has quickly emerged as a highly desirable tool for using current reporting capabilities to uncover and understand hidden patterns in vast databases.

6.12 Trends in Data Mining Different types of data are available for data mining tasks, thus data mining approaches poses many challenging research issues in data mining. The design of a standard data mining languages, the development of effective and efficient data mining methods and systems, the construction of interactive and integrated data mining environments, and the applications of data mining to solve large applications problems are important tasks for data mining researches and data mining system and application developers. Here, we will discuss some of the trends in data mining that reflect the pursuit of these challenges:

6.12.1 Application Exploration Earlier, data mining was mainly used for business purpose to overcome the competitors. But as data mining is becoming more popular it is gaining wide acceptance in other fields too such as biomedicine, stock market, fraud detection, telecommunication and many more. And many new explorations are being done for this purpose. In addition, data mining for business continues to expand as e-commerce and marketing becomes mainstream elements of the retail industry.

6.12.2 Scalable Data Mining Methods The current data mining methods capable of handling only a specific type of data and limited amount of data, but as data is expanding at a massive rate, there is a need to develop new data mining methods, which are scalable and can handle different types of data and large volume of data.

The data mining methods should be more interactive and user friendly. One important direction towards improving the repair efficiency of the timing process while increasing user interaction is constraint-based mining. This provide user with more control by allowing the specification and use of constraints to guide data mining systems in their search for interesting patterns.

6.12.3 Combination of Data Mining with Database Systems, Data Warehouse Systems, and Web Database Systems Database systems, data warehouse systems, and WWW are loaded with huge amounts of data and have thus become the major information processing systems. It is important to ensure that data mining serves as essential data analysis component that can be easily included into such an information-processing environment. The desired architecture for data mining system is the tight coupling with database and data warehouse systems. Transaction management query processing, online analytical processing and online analytical mining should be integrated into one unified framework.

6.12.4 Standardisation of Data Mining Language Today, few data mining languages are commercially available in the market like Microsoft’s SQL server 2005, IBM Intelligent Miner, SAS Enterprise Miner, SGI Mineset, Clementine , DBMiner and many more but a standard data mining language or other standardisation efforts will provide the orderly development of data mining solutions, improved interpretability among multiple data mining systems and functions.

125/JNU OLE Data Mining

6.12.5 Visual Data Mining It is rightly said a picture is worth a thousand words. So, if the result of the mined data can be shown in the visual form it will further enhance the worth of the mined data. Visual data mining is an effective way to discover knowledge from huge amounts of data. The systematic study and development of visual data mining techniques will promote the use for data mining analysis.

6.12.6 New Methods for Mining Complex Types of Data The complex types of data like geospatial, multimedia, time series, sequence and text data poses an important research area in field of data mining. There is still a huge gap between the needs for these applications and the available technology.

6.12.7 Web Mining The World Wide Web is huge collection of globally distributed collection of news, advertisements, consumer records, financial, education, government, e-commerce and many other services. The WWW also contains vast and dynamic collection hyper linked information, providing a huge source for data mining. Based on the above facts, the web also poses great challenges for efficient resource and knowledge discovery.

Since, data mining is a young discipline with wide and diverse applications, there is still a nontrivial gap between general principles of data mining and domain specific, effective data mining tools for particular applications. A few application domains of Data Mining (such as finance, the retail industry and telecommunication) and Trends in Data Mining, which include further efforts towards the exploration of new application areas and new methods for handling complex data types, algorithms scalability, constraint based mining and visualisation methods, the integration of data mining with data warehousing and database systems, the standardisation of data mining languages, and data privacy protection and security.

6.13 System Products and Research Prototypes Although data mining is a relatively young field with many issues that still need to be researched in depth, many off-the-shelf data mining system products and domain specific data mining application software are available. As a discipline, data mining has a relatively short history and is constantly evolving—new data mining systems appear on the market every year; new functions, features, and visualisation tools are added to existing systems on a constant basis; and efforts toward the standardisation of data mining language are still underway. Therefore, it is not our intention in this book to provide a detailed description of commercial data mining systems. Instead, we describe the features to consider when selecting a data mining product and offer a quick introduction to a few typical data mining systems. Reference articles, websites, and recent surveys of data mining systems are listed in the bibliographic notes.

6.13.1 Choosing a Data Mining System With many data mining system products available in the market, you may ask, “What kind of system should I choose?” Some people may be under the impression that data mining systems, like many commercial relational database systems, share the same well defined operations and a standard query language, and behave similarly on common functionalities. If such were the case, the choice would depend more on the systems’ hardware platform, compatibility, robustness, scalability, price, and service. Unfortunately, this is far from reality. Many commercial data mining systems have little in common with respect to data mining functionality or methodology and may even work with completely different kinds of data sets. To choose a data mining system that is appropriate for your task, it is essential to have a multidimensional view of data mining systems. In general, data mining systems should be assessed based on the following multiple features: • Data types: Most data mining systems that are available in the market handle formatted, record-based, relational- like data with numerical, categorical, and symbolic attributes. The data could be in the form of ASCII text, relational database data, or data warehouse data. It is important to check what exact format(s) each system you are considering can handle. Some kinds of data or applications may require specialised algorithms to search for patterns, and so their requirements may not be handled by off-the-shelf, generic data mining systems. Instead, specialised data mining systems may be used, which mine either text documents, geospatial data, multimedia

126/JNU OLE data, stream data, time-series data, biological data, or Web data, or are dedicated to specific applications (such as finance, the retail industry, or telecommunications). Moreover, many data mining companies offer customised data mining solutions that incorporate essential data mining functions or methodologies. • System issues: A given data mining system may run on only one operating system or on several. The most popular operating systems that host data mining software are UNIX/Linux and Microsoft Windows. There are also data mining systems that run on Macintosh, OS/2, and others. Large industry-oriented data mining systems often adopt a client/server architecture, where the client could be a personal computer, and the server could be a set of powerful parallel computers. A recent trend has data mining systems providing Web-based interfaces and allowing XML data as input and/or output. • Data sources: This refers to the specific data formats on which the data mining system will operate. Some systems work only on ASCII text files, whereas many others work on relational data or data warehouse data, accessing multiple relational data sources. It is essential that a data mining system supports ODBC connections or OLE DB for ODBC connections. These make sure open database connections, that is, the ability to access any relational data (including those in IBM/DB2, Microsoft SQL Server, Microsoft Access, Oracle, Sybase, and so on), as well as formatted ASCII text data. • Data mining functions and methodologies: Data mining functions form the core of a data mining system. Some data mining systems provide only one data mining function, such as classification. Others may support multiple data mining functions, such as concept description, discovery-driven OLAP analysis, association mining, linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, sequential pattern analysis, and visual data mining. For a given data mining function (such as classification), some systems may support only one method, whereas others may support a wide variety of methods (such as decision tree analysis, Bayesian networks, neural networks, support vector machines, rule based classification, k-nearest- neighbour methods, genetic algorithms, and case-based reasoning). Data mining systems that support multiple data mining functions and multiple methods per function provide the user with greater flexibility and analysis power. Many problems may require users to try a few different mining functions or incorporate several together, and different methods can be more effective than others for different kinds of data. In order to take advantage of the added flexibility, however, users may require further training and experience. Thus, such systems should also provide novice users with convenient access to the most popular function and method, or to default settings. • Coupling data mining with database and/or data warehouse systems: A data mining system should be coupled with a database and/or data warehouse system, where the coupled components are seamlessly integrated into a uniform information processing environment. In general, there are four forms of such coupling: no coupling, loose coupling, semi tight coupling, and tight coupling. Some data mining systems work only with ASCII data files and are not coupled with database or data warehouse systems at all. Such systems have difficulties using the data stored in database systems and handling large data sets efficiently. In data mining systems that are loosely coupled with database and data warehouse systems, the data are retrieved into a buffer or main memory by database or warehouse operations, and then mining functions are applied to analyse the retrieved data. These systems may not be equipped with scalable algorithms to handle large data sets when processing data mining queries. The coupling of a data mining system with a database or data warehouse system may be semi tight, providing the efficient implementation of a few essential data mining primitives (such as sorting, indexing, aggregation, histogram analysis, multiway join, and the precomputation of some statistical measures). Ideally, a data mining system should be tightly coupled with a database system in the sense that the data mining and data retrieval processes are integrated by optimising data mining queries deep into the iterative mining and retrieval process. Tight coupling of data mining with OLAP-based data warehouse systems is also desirable so that data mining and OLAP operations can be integrated to provide OLAP-mining features. • Scalability: Data mining has two kinds of scalability issues: row (or database size) scalability and column (or dimension) scalability. A data mining system is considered row scalable if, when the number of rows is enlarged 10 times, it takes no more than 10 times to execute the same data mining queries. A data mining system is considered column scalable if the mining query execution time increases linearly with the number of columns (or attributes or dimensions). Due to the curse of dimensionality, it is much more challenging to make a system column scalable than row scalable.

127/JNU OLE Data Mining

• Visualisation tools: “A picture is worth a thousand words”—this is very true in data mining. Visualisation in data mining can be categorised into data visualisation, mining result visualisation, mining process visualisation, and visual data mining. The variety, quality, and flexibility of visualisation tools may strongly influence the usability, interpretability, and attractiveness of a data mining system. • Data mining query language and graphical user interface: Data mining is an exploratory process. An easy-to- use and high-quality graphical user interface is necessary in order to promote user-guided, highly interactive data mining. Most data mining systems provide user-friendly interfaces for mining. However, unlike relational database systems, where most graphical user interfaces are constructed on top of SQL (which serves as a standard, well-designed database query language), most data mining systems do not share any underlying data mining query language. Lack of a standard data mining language makes it difficult to standardise data mining products and to make sure the interoperability of data mining systems. Recent efforts at defining and standardising data mining query languages include Microsoft’s OLE DB for Data Mining.

6.14 Additional Themes on Data Mining Due to the broad scope of data mining and the large variety of data mining methodologies, all the themes on data mining cannot be thoroughly learnt. However, some of the important themes on data mining are mentioned below.

6.14.1 Theoretical Foundations of Data Mining Research on the theoretical foundations of data mining has yet to mature. A solid and systematic theoretical foundation is important because it can help to provide a coherent framework for the development, evaluation, and practice of data mining technology. Several theories for the basis of data mining include the following: • Data reduction: In this theory, the basis of data mining is to reduce the data representation. Data reduction trades accuracy for speed in response to the need to obtain quick approximate answers to queries on very large databases. Data reduction techniques include singular value decomposition (the driving element behind principal components analysis), wavelets, regression, log-linear models, histograms, clustering, sampling, and the construction of index trees. • Data compression: According to this theory, the basis of data mining is to compress the given data by encoding in terms of bits, association rules, decision trees, clusters, and so on. Encoding based on the minimum description length principle states that the “best” theory to infer from a set of data is the one that minimises the length of the theory and the length of the data when encoded, using the theory as a predictor for the data. This encoding is typically in bits. • Pattern discovery: In this theory, the basis of data mining is to discover patterns occurring in the database, such as associations, classification models, sequential patterns, and so on. Areas such as machine learning, neural network, association mining, sequential pattern mining, clustering, and several other subfields contribute to this theory. • Probability theory: This is based on statistical theory. In this theory, the basis of data mining is to discover joint probability distributions of random variables, for example, Bayesian belief networks or hierarchical Bayesian models. • Microeconomic view: The microeconomic view considers data mining as the task of finding patterns that are interesting only to the extent that they can be used in the decision-making process of some enterprise (for example, regarding marketing strategies and production plans). This view is one of utility, in which patterns are considered interesting if they can be acted on. Enterprises are regarded as facing optimisation problems, where the object is to maximise the utility or value of a decision. In this theory, data mining becomes a nonlinear optimisation problem. • Inductive databases: According to this theory, a database schema consists of data and patterns that are stored in the database. Data mining is therefore the problem of performing induction on databases, where the task is to query the data and the theory (that is patterns) of the database. This view is popular among many researchers in database systems.

128/JNU OLE These theories are not mutually exclusive. For example, pattern discovery can also be seen as a form of data reduction or data compression. Ideally, a theoretical framework should be able to model typical data mining tasks (such as association, classification, and clustering), have a probabilistic nature, be able to handle different forms of data, and consider the iterative and interactive essence of data mining. Further efforts are required toward the establishment of a well-defined framework for data mining, which satisfies these requirements.

6.14.2 Statistical Data mining The data mining techniques described in this book are primarily database-oriented, that is, designed for the efficient handling of huge amounts of data that are typically multidimensional and possibly of various complex types. There are, however, many well-established statistical techniques for data analysis, particularly for numeric data. These techniques have been applied extensively to some types of scientific data (for example, data from experiments in physics, engineering, manufacturing, psychology, and medicine), as well as to data from economics and the social sciences. • Regression: In general, these methods are used to predict the value of a response (dependent) variable from one or more predictor (independent) variables where the variables are numeric. There are various forms of regression, such as linear, multiple, weighted, polynomial, nonparametric, and robust (robust methods are useful when errors fail to satisfy normalcy conditions or when the data contain significant outliers). • Generalised linear models: These models, and their generalisation (generalised additive models), allow a categorical response variable (or some transformation of it) to be related to a set of predictor variables in a manner similar to the modelling of a numeric response variable using linear regression. Generalised linear models include logistic regression and Poisson regression. • Analysis of variance: These techniques analyse experimental data for two or more populations described by a numeric response variable and one or more categorical variables (factors). In general, an ANOVA (single-factor analysis of variance) problem involves a comparison of k population or treatment means to determine if at least two of the means are different. More complex ANOVA problems also exist. • Mixed-effect models: These models are for analysing grouped data—data that can be classified according to one or more grouping variables. They typically describe relationships between a response variable and some covariates in data grouped according to one or more factors. Common areas of application include multilevel data, repeated measures data, block designs, and longitudinal data. • Factor analysis: This method is used to determine which variables are combined to generate a given factor. For example, for many psychiatric data, it is not possible to measure a certain factor of interest directly (such as intelligence); however, it is often possible to measure other quantities (such as student test scores) that reflect the factor of interest. Here, none of the variables are designated as dependent. • Discriminant analysis: This technique is used to predict a categorical response variable. Unlike generalised linear models, it assumes that the independent variables follow a multivariate normal distribution. The procedure attempts to determine several discriminant functions (linear combinations of the independent variables) that discriminate among the groups defined by the response variable. Discriminant analysis is commonly used in social sciences. • Time series analysis: There are many statistical techniques for analysing time-series data, such as auto regression methods, univariate ARIMA (autoregressive integrated moving average) modelling, and long-memory time- series modelling. • Survival analysis: Several well-established statistical techniques exist for survival analysis. These techniques originally were designed to predict the probability that a patient undergoing a medical treatment would survive at least to time t. Methods for survival analysis, however, are also commonly applied to manufacturing settings to estimate the life span of industrial equipment. Popular methods include Kaplan-Meier estimates of survival, Cox proportional hazards regression models, and their extensions. • Quality control: Various statistics can be used to prepare charts for quality control, such as Shewhart charts and cusum charts (both of which display group summary statistics). These statistics include the mean, standard deviation, range, count, moving average, moving standard deviation, and moving range.

129/JNU OLE Data Mining

6.14.3 Visual and Audio Data Mining Visual data mining discovers implicit and useful knowledge from large data sets using data and/or knowledge visualisation techniques. The human visual system is controlled by the eyes and brain, the latter of which can be thought of as a powerful, highly parallel processing and reasoning engine containing a large knowledge base. Visual data mining essentially combines the power of these components, making it a highly attractive and an effective tool for the comprehension of data distributions, patterns, clusters, and outliers in data.

Visual data mining can be viewed as an integration of two disciplines: data visualisation and data mining. It is also closely related to computer graphics, multimedia systems, human computer interaction, pattern recognition, and high-performance computing. In general, data visualisation and data mining can be integrated in the following ways: • Data visualisation: Data in a database or data warehouse can be viewed at different levels of granularity or abstraction, or as different combinations of attributes or dimensions. Data can be presented in various visual forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, link graphs, and so on. Visual display can help give users a clear impression and overview of the data characteristics in a database. • Data mining result visualisation: Visualisation of data mining results is the presentation of the results or knowledge obtained from data mining in visual forms. Such forms may include scatter plots and boxplots (obtained from descriptive data mining), as well as decision trees, association rules, clusters, outliers, generalised rules, and so on. • Data mining process visualisation: This type of visualisation presents the various processes of data mining in visual forms so that users can see how the data are extracted and from which database or data warehouse they are extracted, as well as how the selected data are cleaned, integrated, preprocessed, and mined. Moreover, it may also show which method is selected for data mining, where the results are stored, and how they may be viewed. • Interactive visual data mining: In (interactive) visual data mining, visualisation tools can be used in the data mining process to help users make smart data mining decisions. For example, the data distribution in a set of attributes can be displayed using coloured sectors (where the whole space is represented by a circle). This display helps users to determine which sector should first be selected for classification and where a good split point for this sector may be.

Audio data mining uses audio signals to indicate the patterns of data or the features of data mining results. Although visual data mining may disclose interesting patterns using graphical displays, it requires users to concentrate on watching patterns and identifying interesting or novel features within them. This can sometimes be quite tiresome. If patterns can be transformed into sound and music, then instead of watching pictures, we can listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual. This may relieve some of the burden of visual concentration and be more relaxing than visual mining. Therefore, audio data mining is an interesting complement to visual mining.

6.14.4 Data Mining and Collaborative Filtering Today’s consumers are faced with millions of goods and services when shopping on-line. Recommender systems help consumers by making product recommendations during live customer transactions. A collaborative filtering approach is commonly used, in which products are recommended based on the opinions of other customers. Collaborative recommender systems may employ data mining or statistical techniques to search for similarities among customer preferences.

A collaborative recommender system works by finding a set of customers, referred to as neighbours, that have a history of agreeing with the target customer (such as, they tend to buy similar sets of products, or give similar ratings for certain products). Collaborative recommender systems face two major challenges: scalability and ensuring quality recommendations to the consumer. Scalability is important, because e-commerce systems must be able to search through millions of potential neighbours in real time. If the site is using browsing patterns as indications of product preference, it may have thousands of data points for some of its customers. Ensuring quality recommendations

130/JNU OLE is essential in order to gain consumers’ trust. If consumers follow a system recommendation but then do not end up liking the product, they are less likely to use the recommender system again. As with classification systems, recommender systems can make two types of errors: false negatives and false positives. Here, false negatives are products that the system fails to recommend, although the consumer would like them. False positives are products that are recommended, but which the consumer does not like. False positives are less desirable because they can annoy or anger consumers.

An advantage of recommender systems is that they provide personalisation for customers of e-commerce, promoting one-to-one marketing. Dimension reduction, association mining, clustering, and Bayesian learning are some of the techniques that have been adapted for collaborative recommender systems. While collaborative filtering explores the ratings of items provided by similar users, some recommender systems explore a content-based method that provides recommendations based on the similarity of the contents contained in an item. Moreover, some systems integrate both content-based and user-based methods to achieve further improved recommendations. Collaborative recommender systems are a form of intelligent query answering, which consists of analysing the intent of a query and providing generalised, neighbourhood, or associated information relevant to the query.

131/JNU OLE Data Mining

Summary • Data mining is the process of extraction of interesting patterns or knowledge from huge amount of data. It is the set of activities used to find new, hidden or unexpected patterns in data or unusual patterns in data. • An important feature of object-relational and object-oriented databases is their capability of storing, accessing, and modelling complex structure-valued data, such as set- and list-valued data and data with nested structures. • Generalisation on a class composition hierarchy can be viewed as generalisation on a set of nested structured data. • A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. • Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their compositions, such as networks or partitions. • Text Data Analysis and Information Retrieval Information retrieval (IR) is a field that has been developing in parallel with database systems for many years. • Once an inverted index is created for a document collection, a retrieval system can answer a keyword query quickly by looking up which documents contain the query keywords. • The World Wide Web serves as a huge, widely distributed, global information service centre for news, advertisements, consumer information, financial management, education, government, e-commerce, and many other information services. • Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities as well as approaches for disease diagnosis, prevention and treatment.

References • Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier. • Alexander, D., Data Mining [Online] Available at: . [Accessed 9 September 2011]. • Galeas, Web mining [Online PDF} Available at: . [Accessed 12 September 2011]. • Springerlink, 2006. Data Mining System Products and Research Prototypes [Online PDF] Available at: . [Accessed 12 September 2011]. • Dr. Kuonen, D., 2009. Data Mining Applications in Pharma/BioPharma Product Development [Video Online] Available at: . [Accessed 12 September 2011]. • SalientMgmtCompany, 2011. Salient Visual Data Mining [Video Online] Available at: < http://www.youtube. com/watch?v=fosnA_vTU0g>. [Accessed 12 September 2011].

Recommended Reading • Liu, B., 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed., Springer. • Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc. (IGI). • Markov, Z. and Larose, D. T., 2007. Data mining the Web: uncovering patterns in Web content, structure, and usage, Wiley-Interscience.

132/JNU OLE Self Assessment 1. Using information contained within______, data mining can often provide answers to questions about an organisation that a decision maker has previously not thought to ask. a. metadata b. web mining c. data warehouse d. data extraction

2. Aggregation and approximation are another important means of ______. a. generalisation b. visualisation c. standardisation d. organisation

3. Which of the following stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data? a. Knowledge database b. Spatial database c. Data mining d. Data warehouse

4. ______systems usually handle vector data that consist of points, lines, polygons and their compositions, such as networks or partitions. a. Knowledge database b. Data mining c. Data warehouse d. Spatial database

5. ______system stores and manages a large collection of multimedia data. a. Knowledge database b. Multimedia data mining c. Multimedia database d. Spatial database

6. Which of the following can contain additional dimensions and measures for multimedia information, such as colour, texture, and shape? a. Multimedia data cube b. Multimedia data mining c. Multimedia database d. Spatial database

133/JNU OLE Data Mining

7. What is used for mining multimedia data, especially in scientific research, such as astronomy, seismology, and geoscientific research? a. Data warehousing b. Classification and predictive modelling c. Data extraction d. Dimensional modelling

8. ______is the percentage of retrieved documents that are in fact relevant to the query. a. Recall b. Precision c. Text mining d. Information retrieval

9. Match the columns

A. The WWW contains huge and dynamic collection hyper 1. Application Exploration linked information, providing a huge source for data mining.

B. It is an effective way to discover knowledge from huge 2. Scalable data mining methods amounts of data. C. One important direction towards improving the repair effi- 3. Visual data mining ciency of the timing process while increasing user interaction is constraint-based mining. D. In addition for data mining for business continues to expand 4. Web mining as e-commerce and marketing becomes mainstream elements of the retail industry. a. 1-A, 2-B, 3-C, 4-D b. 1-D, 2-C, 3-B, 4-A c. 1-B, 2-A, 3-C, 4-D d. 1-C, 2-B, 3-A, 4-D

10. In which of the following theories, the basis of data mining is to reduce the data representation? a. Data reduction b. Data compression c. Pattern discovery d. Probability theory

134/JNU OLE Chapter VII Implementation and Maintenance

Aim

The aim of this chapter is to:

• introduce the physical design steps

• analyse physical storage of data

• explore indexing the data warehouse

Objectives

The objectives of this chapter are to:

• explicate different performance enhancement techniques

• describe data warehouse deployment

• elucidate how to manage data warehouse

Learning outcome

At the end of this chapter, you will be able to:

• comprehend B-tree, clustered and bitmapped indexes

• enlist different models of data mining

• understand growth and maintenance of database

135/JNU OLE Data Mining

7.1 Introduction As you know, in an OLTP system, you have to perform a number of tasks for completing the physical model. The logical model forms the primary basis for the physical model. But, in addition, a number of factors must be considered before get to the physical model. You must determine where to place the database objects in physical storage. What is the storage medium and what are its features? This information helps you to define the storage parameters. Later you need to plan for indexing, which is an important consideration on which columns in each table the indexes must be built..? You need to look into other methods for improving performance. You have to examine the initialisation parameters in the DBMS and decide how to set them. Similarly, in the data warehouse environment, you need to consider many different factors to complete the physical model.

7.2 Physical Design Steps Following is a pictorial representation of the steps in the physical design process for a data warehouse. Note the steps indicated in the figure. In the following subsections, we will broadly describe the activities within these steps. You will understand how at the end of the process you arrive at the completed physical model.

Develop Standards Create Aggregates Determine Plan Data Portioning

Establish Clustering Options

Prepare Indexing Strategy Complete Assign Storage Physical Structures Model

Fig. 7.1 Physical design process

7.2.1 Develop Standards Many companies invest a lot of time and money to prescribe standards for information systems. The standards range from how to name the fields in the database to how to conduct interviews with the user departments for requirements definition. A group in IT is designated to keep the standards up-to-date. In some companies, every revision must be updated and authorised by the CIO. Through the standards group, the CIO makes sure that the standards are followed correctly and strictly. Now the practice is to publish the standards on the company’s intranet. If your IT department is one of the progressive ones giving due attention to standards, then be happy to embrace and adapt the standards for the data warehouse. In the data warehouse environment, the scope of the standards expands to include additional areas. Standards ensure consistency across the various areas. If you have the same way of indicating names of the database objects, then you are leaving less room for ambiguity. Standards take on greater importance in the data warehouse environment. This is because the usage of the object names is not confined to the IT department. The users will also be referring to the objects by names when they formulate and run their own queries.

136/JNU OLE 7.2.2 Create Aggregates Plan If your data warehouse stores data only at the lowest level of granularity, every such query has to read through all the detailed records and sum them up. Consider a query looking for total sales for the year, by product, for all the stores. If you have detailed records keeping sales by individual calendar dates, by product, and by store, then this query needs to read a large number of detailed records. So what is the best method to improve performance in such cases? If you have higher levels of summary tables of products by store, the query could run faster. But how many such summary tables must you create? What is the limit?

In this step, review the possibilities for building aggregate tables. You get clues from the requirements definition. Look at each dimension table and examine the hierarchical levels. Which of these levels are more important for aggregation? Clearly assess the tradeoff. What you need is a comprehensive plan for aggregation. The plan must spell out the exact types of aggregates you must build for each level of summarisation. It is possible that many of the aggregates will be present in the OLAP system. If OLAP instances are not for universal use by all users, then the necessary aggregates must be present in the main warehouse. The aggregate database tables must be laid out and included in the physical model.

7.2.3 Determine the Data Partitioning Scheme Consider the data volumes in the warehouse. What about the number of rows in a fact table? Let us make some rough calculations. Assume there are four dimension tables with 50 rows each on average. Even with this limited number of dimension table rows, the potential number of fact table rows exceeds six million. Fact tables are generally very large. Large tables are not easy to manage. During the load process, the entire table must be closed to the users. Again, back up and recovery of large tables pose difficulties because of their sheer sizes.

Partitioning divides large database tables into manageable parts. Always consider partitioning options for fact tables. It is not just the decision to partition that counts. Based on your environment, the real decision is about how exactly to partition the fact tables. Your data warehouse may be a conglomerate of conformed data marts. You must consider partitioning options for each fact table. You may find that some of your dimension tables are also candidates for partitioning. Product dimension tables are especially large. Examine each of your dimension tables and determine which of these must be partitioned. In this step, come up with a definite partitioning scheme. The scheme must include:

The fact tables and the dimension tables selected for partitioning • The type of partitioning for each table—horizontal or vertical • The number of partitions for each table • The criteria for dividing each table (for example, by product groups) • Description of how to make queries aware of partitions

7.2.4 Establish Clustering Options In the data warehouse, many of the data access patterns rely on sequential access of large quantities of data. Whenever you have this type of access and processing, you will realise much performance improvement from clustering. This technique involves placing and managing related units of data in the same physical block of storage. This arrangement causes the related units of data to be retrieved together in a single input operation. You need to establish the proper clustering options before completing the physical model. Examine the tables, table by table, and find pairs that are related. This means that rows from the related tables are usually accessed together for processing in many cases. Then make plans to store the related tables close together in the same file on the medium.

For two related tables, you may want to store the records from both files interleaved. A record from one table is followed by all the related records in the other table while storing in the same file.

137/JNU OLE Data Mining

7.2.5 Prepare an Indexing Strategy This is a crucial step in the physical design. Unlike OLTP systems, the data warehouse is query-centric. As you know, indexing is perhaps the most effective mechanism for improving performance. A solid indexing strategy results in enormous benefits. The strategy must lay down the index plan for each table, indicating the columns selected for indexing. The sequence of the attributes in each index also plays a critical role in performance. Scrutinise the attributes in each table to determine, which attributes qualify for bit-mapped indexes. Prepare a comprehensive indexing plan. The plan must indicate the indexes for each table. Further, for each table, present the sequence in which the indexes will be created. Describe the indexes that are expected to be built in the very first instance of the database. Many indexes can wait until you have monitored the data warehouse for some time. Spend enough time on the indexing plan.

7.2.6 Assign Storage Structures Where do you want to place the data on the physical storage medium? What are the physical files? What is the plan for assigning each table to specific files? How do you want to divide each physical file into blocks of data? Answers to questions like these go into the data storage plan. In an OLTP system, all data resides in the operational database. When you assign the storage structures in an OLTP system, your effort is confined to the operational tables accessed by the user applications. In a data warehouse, you are not just concerned with the physical files for the data warehouse tables. Your storage assignment plan must include other types of storage such as the temporary data extract files, the staging area, and any storage needed for front-end applications. Let the plan include all the types of storage structures in the various storage areas.

7.2.7 Complete Physical Model This final step reviews and confirms the completion of the prior activities and tasks. By the time you reach this step, you have the standards for naming the database objects. You have determined which aggregate tables are necessary and how you are going to partition the large tables. You have completed the indexing strategy and have planned for other performance options. You also know where to put the physical files. All the information from the prior steps allows you to complete the physical model. The result is the creation of the physical schema. You can code the data definition language statements (DDL) in the chosen RDBMS and create the physical structure in the .

7.3 Physical Storage Consider the processing of a query. After the query is verified for syntax and checked against the data dictionary for authorisation, the DBMS translates the query statements to determine what data is requested. From the data entries about the tables, rows, and columns desired, the DBMS maps the requests to the physical storage where the data access take place. The query gets filtered down to physical storage and this is where the input operations begin. The efficiency of the data retrieval is closely tied to where the data is stored in physical storage and how it is stored there.

What are the various physical data structures in the storage area? What is the storage medium and what are its characteristics? Do the features of the medium support any efficient storage or retrieval techniques? We will explore answers to questions such as these. From the answers you will derive methods for improving performance.

7.3.1 Storage Area Data Structures Take an overall look at all the data related to the data warehouse. First, you have the data in the staging area. Though you may look for efficiency in storage and loading, arrangement of the data in the staging area does not contribute to the performance of the data warehouse from the point of view of the users. Looking further, the other sets of data relate to the data content of the warehouse. These are the data and index tables in the data warehouse. How you arrange and store these tables definitely has an impact on the performance. Next, you have the multidimensional data in the OLAP system. In most cases, the supporting proprietary software dictates the storage and the retrieval of data in the OLAP system.

138/JNU OLE You can see following figure showing the physical data structures in the data warehouse. Observe the different levels of data. Notice the detail and summary data structures. Think further how the data structures are implemented in physical storage as files, blocks, and records.

Data Warehouse Repository Data Staging Area

Partitioned Relational database data files physical (warehouse data) files Data extract flat files? Detailed data Relational database index and light Relational database data files files summaries (transformed data)

Relational database index OLAP system files

Load image flat files

Physical files in proprietary matrix format storing multidimensional cubes of data

Fig. 7.2 Data structures in the warehouse

7.3.2 Optimising Storage You have reviewed the physical storage structures. When you break each data structure down to the physical storage level, you find that the structure is stored as files in the physical storage medium. Take the example of the customer dimension and the salesperson dimension tables. You have basically two choices for storing the data of these two dimension tables. Store records from each table in one physical file. Or, if the records from these tables are retrieved together most of the time, then store records from both the tables in a single physical file. In either case, records are stored in a file. A collection of records in a file forms a block. In other words, a file comprises blocks and each block contains records.

Remember, any optimising at the physical level is tied to the features and functions available in the DBMS. You have to relate the techniques discussed here with the workings of your DBMS. Please study the following optimising techniques.

Set the correct block size A data block in a file is the fundamental unit of input/output transfer from the database to memory where the data gets manipulated. Each block contains a block header that holds control information. The block header is not open to keep data. Too many block headers means too much wasted space.

139/JNU OLE Data Mining

More records or rows will fit into a single block. Because more records may be fetched in one read, larger block sizes decrease the number of reads. Another advantage relates to space utilisation by the block headers. As a percentage of the space in a block, the block header occupies less space in a larger block. Therefore, overall, all the block headers put together occupy less space. But here is the downside of larger block sizes. Even when a smaller number of records are needed, the operating system reads too much extra information into memory, thereby impacting memory management.

However, because most data warehouse queries request large numbers of rows, memory management as indicated rarely poses a problem. There is another aspect of data warehouse tables that could cause some concern. Data warehouse tables are denormalised and therefore, the records tend to be large. Sometimes, a record may be too large to fit in a single block. Then the record has to be split across more than one block. The broken parts have to be connected with pointers or physical addresses. Such pointer chains affect performance to a large extent. Consider all the factors and set the block size at the appropriate size. Generally, increased block size gives better performance but you have to find the proper size.

Set the proper block usage parameters Most of the leading DBMSs allow you to set block usage parameters at appropriate values and derive performance improvement. You will find that these usage parameters themselves and the methods for setting them are dependent on the database software. Generally, two parameters govern the usage of the blocks, and proper usage of the blocks improves performance.

Manage When a record in a block is updated and there is no enough space in the same block for storing the expanded record, then most DBMSs move the entire updated record to another block and create a pointer to the migrated record. Such migration affects the performance, requiring multiple blocks to be read. This problem may be resolved by adjusting the block percent free parameter. However, migration is not a major problem in data warehouses because of the negligible number of updates.

Manage block utilisation Performance degenerates when data blocks contain excessive amounts of free space. Whenever a query calls for a full table scan, performance suffers because of the need to read too many blocks. Manage block underutilisation by adjusting the block percent free parameter downward and the block percent used parameter upward.

Resolve dynamic extension When the current extent on disk storage for a file is full, the DBMS finds a new extent and allows an insert of a new record. This task of finding a new extension on the fly is referred to as dynamic extension. However, dynamic extension comes with significant overhead. Reduce dynamic extension by allocation of large initial extents.

Employ file striping techniques You perform file striping when you split the data into multiple physical parts and store these individual parts on separate physical devices. File striping allows concurrent input/output operations and improves file access performance substantially.

7.3.3 Using RAID Technology Redundant array of inexpensive disks (RAID) technology has become common to the extent that almost all of today’s data warehouses make good use of this technology. These disks are found on large servers. The arrays enable the server to continue operation even while they are recovering from the failure of any single disk. The underlying technique that gives the primary benefit of RAID breaks the data into parts and writes the parts to multiple disks in a striping fashion. The technology can recover data when a disk fails and reconstruct the data. RAID is very fault- tolerant. Here are the basic features of the technology:

140/JNU OLE • Disk mirroring—writing the same data to two disk drives connected to the same controller • Disk duplexing—similar to mirroring, except here each drive has its own distinct controller • Parity checking—addition of a parity bit to the data to ensure correct data transmission • Disk striping—data spread across multiple disks by sectors or bytes

RAID is implemented at six different levels: RAID 0 through RAID 5. Please turn to Figure 7.3, which gives you a brief description of RAID. Note the advantages and disadvantages. The lowest level configuration RAID 0 will provide data striping. At the other end of the range, RAID 5 is a very valuable arrangement.

RAID 0 RAID 1 RAID 2

Data records striped across Disk mirroring with data Data interleaved across disks multiple disks without written redundantly to pairs by bit or block, extra drives redundancy of drives store correction code High performance, less High read performance High performance, corrects expensive--entire array out and availability--expensive 1-bit errors on the fly, even with single disk failure because of data duplication detects 2-bit errors—costly

RAID 3 RAID 4 RAID 5 Data interleaved across Data records interleaved Data records sector-inter- disks by bit or block, one across disks by sectors, one leaved across groups of drive stores parity data drive store parity data drives, most popular

High performance for large Can handle multiple I/Os Dedicated parity drive un- blocks of data--on the fly from sophisticated OS--used necessary, works with 2 or recovery not guaranteed with only two drives more drives--poor write

Fig. 7.3 RAID technology

Estimating storage sizes No discussion of physical storage is complete without a reference to estimation of storage sizes. Every action in the physical model takes place in physical storage. You need to know how much of storage space must be made available initially as the data warehouse expands. Here are a few tips on estimating storage sizes:

For each database table, determine • Initial estimate of the number of rows • Average length of the row • Anticipated monthly increase in the number of rows • Initial size of the table in megabytes (MB) • Calculated table sizes in 6 months and in 12 months

For all tables, determine • The total number of indexes • Space needed for indexes, initially, in 6 months, and in 12 months

141/JNU OLE Data Mining

Estimate • Temporary work space for sorting, merging • Temporary files in the staging area • Permanent files in the staging area

7.4 Indexing the Data Warehouse In a query-centric system like the data warehouse environment, the need to process queries faster dominates. There is no perfect way of turning your users away from the data warehouse than by unreasonably slow queries. For the user in an analysis session going through a rapid succession of complex queries, you have to match the pace of the query results with the speed of thought. Among the various methods to improve performance, indexing ranks are very high.

What types of indexes must you build in your data warehouse? The DBMS vendors offer a variety of choices. The choice is no longer confined to sequential index files. All vendors support B-Tree indexes for efficient data retrieval. Another option is the bitmapped index. As we will see later in this section, this indexing technique is very appropriate for the data warehouse environment. Some vendors are extending the power of indexing to specific requirements. These include indexes on partitioned tables and index-organised tables.

7.4.1 B-Tree Index Most database management systems have the B-Tree Index technique as the default indexing method. When you code statements using the data definition language of the database software to create an index, the system creates a B-Tree index. RDBMSs also create B-Tree indexes automatically on primary key values. The B-Tree index technique supercedes other techniques because of its data retrieval speed, ease of maintenance, and simplicity. Refer to the following figure showing an example of a B-Tree Index. Notice the tree structure with the root at the top. The index consists of a B-Tree (a balanced binary tree) structure based on the values of the indexed column. In the example, the indexed column is Name. This B-Tree is created using all the existing names that are the values of the indexed column. Observe the upper blocks that contain index data pointing to the next lower block. Think of a B-Tree index as containing hierarchical levels of blocks. The lowest-level blocks or the leaf blocks point to the rows in the data table. Note the data addresses in the leaf blocks.

If a column in a table has many unique values, then the selectivity of the column is said to be high. In a territory dimension table, the column for City contains many unique values. This column is therefore highly selective. B-Tree indexes are most suitable for highly selective columns. Because the values at the leaf nodes will be unique they will lead to distinct data rows and not to a chain of rows. What if a single column is not highly selective?

142/JNU OLE A-K

L-Z

A-D L-O

E-G P-R

H-K S-Z

ALLEN HAIG LOEWE SEGEL ENGEL PAINE BUSH IGNAR MAHER TOTO FARIS QUINN CLYNE JONES NIXON VETRI GORE RAJ DUNNE KUMAR OTTO WILLS

ENGEL -- address Pointers FARIS -- address to data GORE -- address rows

Fig. 7.4 B-Tree index example

Indexes grow in direct proportion to the growth of the indexed data table. Wherever indexes contain concatenation of multiple columns, they tend to sharply increase in size. As the data warehouse deals with large volumes of data, the size of the index files can be cause for concern. What can we say about the selectivity of the data in the warehouse? Are most of the columns highly selective? Not really. If you inspect the columns in the dimension tables, you will notice a number of columns that contain low-selectivity data. B-Tree indexes do not work well with data whose selectivity is low. What is the alternative? That leads us to another type of indexing technique.

7.4.2 Bitmapped Index Bitmapped indexes are ideally suitable for low-selectivity data. A bitmap is an ordered series of bits, one for each distinct value of the indexed column. Assume that the column for colour has three distinct colours, namely, white, almond, and black. Construct a bitmap using these three distinct values. Each entry in the bitmap contains three bits. Let us say the first bit refers to white, the second to almond, and the third to black. If a product is white in colour, the bitmap entry for that product consists of three bits, where the first bit is set to 1, the second bit is set to 0, and the third bit is set to 0. If a product is almond in colour, the bitmap entry for that product consists of three bits, where the first bit is set to 0, the second bit is set to 1, and the third bit is set to 0..

7.4.3 Clustered Indexes Some RDBMSs offer a new type of indexing technique. In the B-Tree, bitmapped, or any sequential indexing method, you have a data segment where the values of all columns are stored and an index segment where index entries are kept. The index segment repeats the column values for the indexed columns and also holds the addresses for the entries in the data segment. Clustered tables combine the data segment and the index segments; the two segments are one. Data is the index and index is the data. Clustered tables improve performance considerably because in one read you get the index and the data segments. Using the traditional indexing techniques, you need one read to get the index segment and a second read to get the data segment. Queries run faster with clustered tables when you are looking for exact matches or searching for a range of values. If your RDBMS supports this type of indexing, make use of this technique wherever you can in your environment.

143/JNU OLE Data Mining

7.4.4 Indexing the Fact Table The primary key of the fact table consists of the primary keys of all the connected dimensions. If you have four dimension tables of store, product, time, and promotion, then the full primary key of the fact table is the concatenation of the primary keys of the store, product, time, and promotion tables. What are the other columns? The other columns are metrics such as sale units, sale dollars, cost dollars, and so on. These are the types of columns to be considered for indexing the fact tables.

Please study the following tips and use them when planning to create indexes for the fact tables: • If the DBMS does not create an index on the primary key, deliberately create a B-Tree index on the full primary key. • Carefully design the order of individual key elements in the full concatenated key for indexing. In the high order of the concatenated key, place the keys of the dimension tables frequently referred to while querying. • Review the individual components of concatenated key. Create indexes on combinations based on query processing requirements. • If the DBMS supports intelligent combinations of indexes for access, then you may create indexes on each individual component of the concatenated key. • Do not overlook the possibilities of indexing the columns containing the metrics. For example, if many queries look for dollar sales within given ranges, then the column “dollar sales” is a candidate for indexing. • Bitmapped indexing does not apply to fact tables. There are hardly any low-selectivity columns.

7.4.5 Indexing the Dimension Tables Columns in the dimension tables are used in the predicates of queries. A query may run like this: How much are the sales of Product A in the month of March for the Northern Division? Here, the columns product, month, and division from three different dimension tables are candidates for indexing. Inspect the columns of each dimension table carefully and plan the indexes for these tables. You may not be able to achieve performance improvement by indexing the columns in the fact tables but the columns in the dimension tables offer tremendous possibilities to improve performance through indexing. Here, are few tips on indexing the dimension tables: • Create a unique B-Tree index on the single-column primary key. • Examine the columns that are commonly used to constrain the queries. These are candidates for bitmapped indexes. • Look for columns that are frequently accessed together in large dimension tables. Determine how these columns may be arranged and used to create multicolumn indexes. Remember that the columns that are more frequently accessed or the columns that are at the higher hierarchical levels in the dimension table are placed at the high order of the multicolumn indexes. • Individually index every column likely to be used frequently in join conditions.

7.5 Performance Enhancement Techniques Apart from the indexing techniques, a few other methods also improve performance in a data warehouse. For example, physically compacting the data while writing to storage enables more data to be loaded into a single block. That also means that more data may be retrieved in one read. Another method for improving performance is the merging of tables. Again, this method enables more data to be retrieved in one read. If you purge unwanted and unnecessary data from the warehouse in a regular manner, you can improve the overall performance. In the remainder of this section, let us review a few other effective performance enhancement techniques. Many techniques are available through the DBMS, and most of these techniques are especially suitable for the data warehouse environment.

7.5.1 Data Partitioning Typically, the data warehouse holds some very large database tables. The fact tables run into millions of rows. Dimension tables like the product and customer tables may also contain a huge number of rows. When you have tables of such vast sizes, you face certain specific problems. First, loading of large tables takes excessive time. Then, building indexes for large tables also runs into several hours. What about processing of queries against large tables?

144/JNU OLE Queries also run longer when attempting to sort through large volumes of data to obtain the result sets. Backing up and recovery of huge tables takes an inordinately long time. Again, when you want to selectively purge and archive records from a large table, wading through all the rows takes a long time.

Performing maintenance operations on smaller pieces is easier and faster. Partitioning is a crucial decision and must be planned up front. Doing this after the data warehouse is deployed and goes into production is time-consuming and difficult. Partitioning means deliberate splitting of a table and its index data into manageable parts. The DBMS supports and provides the mechanism for partitioning. When you define the table, you can define the partitions as well. Each partition of a table is treated as a separate object. As the volume increases in one partition, you can split that partition further. The partitions are spread across multiple disks to gain optimum performance. Each partition in a table may have distinct physical attributes, but all partitions of the table have the same logical attributes.

As you observe, partitioning is an effective technique for storage management and improving performance. The benefits are as follows. • A query needs to access only the necessary partitions. Applications can be given the choice to have partition transparency or they may explicitly request an individual partition. Queries run faster when accessing smaller amounts of data. • An entire partition may be taken off-line for maintenance. You can separately schedule maintenance of partitions. Partitions promote concurrent maintenance operations. • Index building is faster. • Loading data into the data warehouse is easy and manageable. • affects only a single partition. Backup and recovery on a single partition reduces downtime. • The input–output load gets balanced by mapping different partitions to the various disk drives.

7.5.2 Data Clustering In the data warehouse, many queries require sequential access of huge volumes of data. The technique of data clustering facilitates such sequential access. Clustering fosters sequential prefetch of related data. You achieve data clustering by physically placing related tables close to each other in storage. When you declare a cluster of tables to the DBMS, the tables are placed in neighbouring areas on disk. How you exercise data clustering depends on the features of the DBMS. Review the features and take advantage of data clustering.

7.5.3 Parallel Processing Consider a query that accesses large quantities of data, performs summations, and then makes a selection based on multiple constraints. It is immediately obvious that you will achieve major performance improvement if you can split the processing into components and execute the components in parallel. The simultaneous concurrent executions will produce the result faster. Several DBMS vendors offer parallel processing features that are transparent to the users. As a designer of the query, the user need not know how a specific query must be broken down for parallel processing. The DBMS will do that for the user. Parallel processing techniques may be applied to data loading and data reorganisation. Parallel processing techniques work in conjunction with data partitioning schemes. The parallel architecture of the server hardware also affects the way parallel processing options may be invoked. Some physical options are critical for effective parallel processing. You have to assess propositions like placing two partitions on the same storage device if you need to process them in parallel. Parallel processing and partitioning together provide great potential for improved performance. However, the designer must decide how to use them effectively.

7.5.4 Summary Levels Select the levels of granularity for the purpose of optimising the input–output operations. If the users frequently request weekly sales information, then consider keeping another summary at the weekly level. On the other hand, if you only keep weekly and monthly summaries and no daily details, every query for daily details cannot be satisfied from the data warehouse. Choose your summary and detail levels carefully based on user requirements.

145/JNU OLE Data Mining

In addition, rolling summary structures are especially useful in a data warehouse. Suppose in your data warehouse you need to keep hourly data, daily data, weekly data, and monthly summaries, create mechanisms to roll the data into the next higher levels automatically with the passage of time. Hourly data automatically gets summarised into the daily data, daily data into the weekly data, and so on.

7.5.5 Referential Integrity Checks Referential integrity constraints make sure the validity between two related tables. The referential integrity rules in the relational model govern the values of the foreign key in the child table and the primary key in the parent table. Every time a row is added or deleted, the DBMS verifies that the referential integrity is preserved. This verification ensures that parent rows are not deleted while child rows exist and that child rows are not added without parent rows. Referential integrity verification is critical in the OLTP systems, but it reduces performance. Now consider the loading of data into the data warehouse. By the time the load images are created in the staging area, the data structures have already gone through the phases of extraction, cleansing, and transformation. The data ready to be loaded has already been verified for correctness as far as parent and child rows are concerned. Therefore, there is no need for further referential integrity verification while loading the data. Turning off referential integrity verification produces significant performance gains.

7.5.6 Initialisation Parameters DBMS installation signals the start of performance improvement. At the start of the installation of the database system, you can carefully plan how to set the initialisation parameters. Most of the times you will realise that performance degradation is to a substantial extent the result of inappropriate parameters. The data warehouse administrator has a special responsibility to choose the right parameters.

7.5.7 Data Arrays What are data arrays? Suppose in a financial data mart you need to keep monthly balances of individual line accounts. In a normalised structure, the monthly balances for a year will be found in twelve separate table rows. Assume that in many queries the users request for the balances for all the months together. How can you improve the performance? You can create a data array or repeating group with twelve slots, each to contain the balance for one month.

Although creating arrays is a clear violation of normalisation principles, this technique yields tremendous performance improvement. In the data warehouse, the time element is interwoven into all data. Frequently, users look for data in a time series. Another example is the request for monthly sales figures for 24 months for each salesperson. If you analyse the common queries, you will be surprised to see how many need data that can be readily stored in arrays.

7.6 Data Warehouse Deployment Once the data warehouse has been designed, built and tested it needs to be deployed so it is available to the user community. This process is also known as ‘roll-out’. This can vary in size from a single-server local deployment (deployed across one country or one location) to a global distributed network involving several time zones and translating data into many different languages.

It is never enough to simply deploy a solution then ‘leave’. Ongoing maintenance and future enhancements must be managed; a programme of user training is often required, apart from the logistics of the deployment itself. Timing of a deployment is critical. Allow too much time and you risk missing your deadlines, while you allow too less time and you run in resourcing problems. As with most IT work, never underestimate the amount of work involved and also the amount of time required.

The data warehouse might need to be customised for deployment to a particular country or location, where they might use the general design but have their own data needs. It is not uncommon for different parts of the same organisation to use different computer systems, particularly where mergers and acquisitions are involved, so the data warehouse must be modified to allow this as part of the deployment.

146/JNU OLE Roll-out to production This takes place after user acceptance testing (UAT) and includes: moving the data warehouse to the live servers, loading all the live data – not just some of it for testing purposes, optimising the databases and implementing security. All of this must involve minimum disruption to the system users. Needless to say, you need to be very confident everything is in place and working before going live – or you might find you have to do it all over again.

Scheduling jobs In a production environment jobs such as data warehouse loading must be automated in scripts and scheduled to run automatically. A suitable time slot must be found that does not conflict with other tasks happening on the same servers. Procedures must be in place to deal with unexpected events and failures.

Regression testing This type of testing is part of the deployment process and searches for errors that were fixed at one point but have somehow been introduced by the change in environment.

7.6.1 Data warehouse Deployment Lifecycle Here is the typical lifecycle for data warehouse deployment project: • Project Scoping and Planning ‚‚ Project Triangle – Scope, Time and Resource ‚‚ Determine the scope of the project – what you would like to accomplish? This can be defined by questions to be answered. The number of logical star and number of the OLTP sources ‚‚ Time – What is the target date for the system to be available to the users ‚‚ Resource – What is our budget? What is the role and profile requirement of the resources needed to make this happen? • Requirement ‚‚ What are the business questions? How does the answer of these questions can change the business decision or trigger actions? ‚‚ What is the role of the users? How often do they use the system? Do they do any interactive reporting or just view the defined reports in guided navigation? ‚‚ How do you measure? What are the metrics? • Front-End Design ‚‚ The front end design needs for both interactive analysis and the designed analytics workflow. ‚‚ How does the user interact with the system? ‚‚ What is their analysis process? • Warehouse Schema Design ‚‚ Dimensional modelling – define the dimensions and fact and define the grain of each star schema. ‚‚ Define the physical schema – depending on the technology decision. If you use the relational technology, design the database tables. • OLTP to data warehouse mapping ‚‚ Logical mapping – table to table and column to column mapping. Also define the transformation rules ‚‚ You may need to perform OLTP data profiling. How often the data changes? What is the data distribution? ‚‚ ETL Design -include data staging and the detail ETL process flow. • Implementation ‚‚ Create the warehouse and ETL staging schema. ‚‚ Develop the ETL programs. ‚‚ Create the logical to physical mapping in the repository. ‚‚ Build the end user and reports.

147/JNU OLE Data Mining

• Deployment ‚‚ Install the Analytics reporting and the ETL tools. ‚‚ Specific Setup and Configuration for OLTP, ETL, and data warehouse. ‚‚ Sizing of the system and database ‚‚ Performance Tuning and Optimisation • Management and Maintenance of the system ‚‚ Ongoing support of the end-users, including security, training, and enhancing the system. ‚‚ You need to monitor the growth of the data.

7.7 Growth and Maintenance More data marts and more deployment versions have to follow. The team needs to ensure that it is well poised for growth. You need to ensure that the monitoring functions are all in place to constantly keep the team informed of the status. The training and support functions must be consolidated and streamlined. The team must confirm that all the administrative functions are ready and working. Database tuning must continue at a regular pace.

Immediately following the initial deployment, the project team must conduct review sessions. Here are the major review tasks: • Review the testing process and suggest recommendations. • Review the goals and accomplishments of the pilots. • Survey the methods used in the initial training sessions. • Document highlights of the development process. • Verify the results of the initial deployment, matching these with user expectations.

The review sessions and their outcomes form the basis for improvement in the further releases of the data warehouse. As you expand and produce further releases, let the business needs, modelling considerations, and infrastructure factors remain as the guiding factors for growth. Follow each release close to the previous release. You can make use of the data modelling done in the earlier release. Build each release as a logical next step. Avoid disconnected releases. Build on the current infrastructure.

7.7.1 Monitoring the Data Warehouse When you implement an OLTP system, you do not stop with the deployment. The database administrator continues to inspect system performance. The project team continues to monitor how the new system matches up with the requirements and delivers the results. Monitoring the data warehouse is comparable to what happens in an OLTP system, except for one big difference. Monitoring an OLTP system dwindles in comparison with the monitoring activity in a data warehouse environment. As you can easily perceive, the scope of the monitoring activity in the data warehouse extends over many features and functions. Unless data warehouse monitoring takes place in a formalised manner, desired results cannot be achieved. The results of the monitoring gives you the data needed to plan for growth and to improve performance.

Following figure presents the data warehousing monitoring activity and its usefulness. As you can observe, the statistics serve as the life-blood of the monitoring activity. That leads into growth planning and fine-tuning of the data warehouse.

148/JNU OLE Data Warehouse

Warehouse Data Review statistics for growth planning and Monitoring Statistics performance tuning

Statistical collection Statistics Collection END- Sampling Event-driven USERS Sample data Record statistics warehouse activity whenever at specific intervals specified events and gather statistics take place and Data warehouse trigger statistics administration

Fig. 7.5 Data warehousing monitoring

7.7.2 Collection of Statistics What we call monitoring statistics are indicators whose values provide information about data warehouse functions. These indicators provide information on the utilisation of the hardware and software resources. From the indicators, you determine how the data warehouse performs. The indicators present the growth trends. You understand how well the servers function. You gain insights into the utility of the end-user tools. How do you collect statistics on the working of the data warehouse? Two common methods apply to the collection process. Sampling methods and event-driven methods are generally used. The sampling method measures specific aspects of the system activity at regular intervals. You can set the duration of the interval. If you set the interval as 10 minutes for monitoring processor utilisation, then utilisation statistics are recorded every 10 minutes. The sampling method has minimal impact on the system overhead. The event-driven methods work differently. The recording of the statistics does not happen at intervals, but only when a specified event takes place.

The tools that come with the database server and the host operating system are generally turned on to collect the monitoring statistics. Over and above these, many third-party vendors supply tools especially useful in a data warehouse environment. Most tools gather the values for the indicators and also interpret the results. The data collector component collects the statistics while thethe analyser component does the interpretation. Most of the monitoring of the system occurs in real time.

The following is a random list that includes statistics for different uses. You will find most of these applicable to your environment. • Physical disk storage space utilisation • Number of times the DBMS is looking for space in blocks or causes fragmentation • Memory buffer activity • Buffer cache usage • Input–output performance • Memory management

149/JNU OLE Data Mining

• Profile of the warehouse content, giving number of distinct entity occurrences (example: number of customers, products, and so on) • Size of each database table • Accesses to fact table records • Usage statistics relating to subject areas • Numbers of completed queries by time slots during the day • Time each user stays online with the data warehouse • Total number of distinct users per day • Maximum number of users during time slots daily • Duration of daily incremental loads • Count of valid users • Query response times • Number of reports run each day • Number of active tables in the database

7.7.3 Using Statistics for Growth Planning As you deploy more versions of the data warehouse, the number of user’s increases and the complexity of the queries intensifies, you then need to plan for the obvious growth. But how do you know where the expansion is needed? Why have the queries slowed down? Why has the response times degraded? Why was the warehouse down for expanding the table spaces? The monitoring statistics provide you with clues as to what is happening in the data warehouse and how you can prepare for the growth. Following are the types of action that are prompted by the monitoring statistics: • Allocate more disk space to existing database tables • Plan for new disk space for additional tables • Modify file block management parameters to minimise fragmentation • Create more summary tables to handle large number of queries looking for summary information • Reorganise the staging area files to handle more data volume • Add more memory buffers and enhance buffer management • Upgrade database servers • Offload report generation to another middle tier • Smooth out peak usage during the 24-hour cycle • Partition tables to run loads in parallel and to manage backups

7.7.4 Using Statistics for Fine-Tuning The next best use of statistics relates to performance. You will find that a large number of monitoring statistics prove to be useful for fine-tuning of the data warehouse. Following are the data warehouse functions that are normally improved based on the information derived from the statistics: • Query performance • Query formulation • Incremental loads • Frequency of OLAP loads • OLAP system • Data warehouse content browsing • Report formatting • Report generation

150/JNU OLE 7.7.5 Publishing Trends for Users This is a new concept not usually found in OLTP systems. In a data warehouse, the users must find their way into the system and retrieve the information by themselves. They must know about the contents. Users must know about the currency of the data in the warehouse. When was the last incremental load? What are the subject areas? What is the count of distinct entities? The OLTP systems are quite different. These systems readily present the users with routine and standardised information. Users of OLTP systems do not need the inside view. Look at the following figure listing the types of statistics that must be published for the users. In your data warehouse is Web-enabled, use the company’s intranet to publish the statistics for the users. Otherwise, provide the ability to inquire into the dataset where the statistics are kept.

Web-enables Intranet data warehouse

Warehouse data Web page statistics and i nformation Metadata

Warehouse subjects Monitoring Statistics Warehouse tables Summary data Warehouse navigation Warehouse statistics Predefined queries Predefined reports Last full load Last incremental load Internal Scheduled downtime Contacts for support End-users User tool upgrades

Fig. 7.6 Statistics for the users

7.8 Managing the Data Warehouse Data warehouse management is concerned with two principal functions. The first is maintenance management. The data warehouse administrative team must keep all the functions going in the best possible manner. The second is change management. As new versions of the warehouse are deployed, as new releases of the tools become available, as improvements and automation take place in the ETL functions, the administrative team’s focus includes enhancements and revisions.

Postdeployment administration covers the following areas: • Performance monitoring and fine-tuning • Data growth management • Storage management • Network management • ETL management • Management of future data mart releases • Enhancements to information delivery • Security administration

151/JNU OLE Data Mining

• Backup and recovery management • Web technology administration • Platform upgrades • Ongoing training • User support

7.8.1 Platform Upgrades Your data warehouse deployment platform includes the infrastructure, the data transport component, end-user information delivery, data storage, metadata, the database components, and the OLAP system components. More often, a data warehouse is a comprehensive cross-platform environment. The components follow a path of dependency, starting with computer hardware at the bottom, followed by the operating systems, communications systems, the databases, GUIs, and then the application support software. As time goes on, upgrades to these components are announced by the vendors. After the initial rollout, have a proper plan for applying the new releases of the platform components. As you have probably experienced with OLTP systems, upgrades cause potentially serious interruption to the normal work unless they are properly managed. Good planning minimises the disruption. Vendors try to force you into upgrades on their schedule based on their new releases. If the timing is not convenient for you, resist the initiatives from the vendors. Schedule the upgrades at your convenience and based on when your users can tolerate interruptions.

7.8.2 Managing Data Growth Managing data growth deserves special attention. In a data warehouse, unless you are vigilant about data growth, it could get out of hand very soon and quite easily. Data warehouses already contain huge volumes of data. When you start with a large volume of data, even a small percentage increase can result in substantial additional data. In the first place, a data warehouse may contain too much historical data. Data beyond 10 years may not produce meaningful results for many companies because of the changed business conditions. End-users tend to opt for keeping detailed data at the lowest grain. At least in the initial stages, the users continue to match results from the data warehouse with those from the operational systems. Analysts produce many types of summaries in the course of their analysis sessions. Quite often, the analysts want to store these intermediary datasets for use in similar analysis in the future. Unplanned summaries and intermediary datasets add to the growth of data volumes. Here are just a few practical suggestions to manage data growth: • Dispense with some detail levels of data and replace them with summary tables. • Restrict unnecessary drill-down functions and eliminate the corresponding detail level data. • Limit the volume of historical data. Archive old data promptly. • Discourage analysts from holding unplanned summaries. • Where genuinely needed, create additional summary tables.

7.8.3 Storage Management As the volume of data increases, so does the utilisation of storage. Because of the huge data volume in a data warehouse, storage costs rank very high as a percentage of the total cost. Experts estimate that storage costs are almost four or five time’s software costs, yet you find that storage management does not receive sufficient attention from data warehouse developers and managers. Here, are few tips on storage management to be used as guidelines: • Additional rollouts of the data warehouse versions require more storage capacity. Plan for the increase. • Make sure that the storage configuration is flexible and scalable. You must be able to add more storage with minimum interruption to the current users. • Use modular storage systems. If not already in use, consider a switchover. • If yours is a distributed environment with multiple servers having individual storage pools, consider connecting the servers to a single storage pool that can be intelligently accessed. • As usage increases, plan to spread data over multiple volumes to minimise access bottlenecks.

152/JNU OLE • Ensure ability to shift data from bad storage sectors. • Look for storage systems with diagnostics to prevent outages.

7.8.4 ETL Management This is a major ongoing administrative function, so attempt to automate most of it. Install an alert system to call attention to exceptional conditions. The following are useful suggestions on ETL (data extraction, transformation, loading) management: • Run daily extraction jobs on schedule. If source systems are not available under extraneous circumstances, reschedule extraction jobs. • If you employ data replication techniques, make sure that the result of the replication process checks out. • Ensure that all reconciliation is complete between source system record counts and record counts in extracted files. • Make sure all defined paths for data transformation and cleansing are traversed correctly. • Resolve exceptions thrown out by the transformation and cleansing functions. • Verify load image creation processes, including creation of the appropriate key values for the dimension and fact table rows. • Check out the proper handling of slowly changing dimensions. • Ensure completion of daily incremental loads on time.

7.8.5 Information Delivery Enhancements As time goes on, you will notice that your users have outgrown the end-user tools they started out with. In the course of time, the users become more proficient with locating and using the data. They get ready for more and more complex queries. New end-user tools appear in the market all the time. Why deny your users the latest and the best if they really can benefit from them? What are the implications of enhancing the end-user tools and adopting a different tool set? Unlike a change to ETL, this change relates to the users directly, so plan the change carefully and proceed with caution.

Please review the following tips: • Make sure the compatibility of the new tool set with all data warehouse components. • If the new tool set is installed in addition to the existing one, switch your users over in stages. • Ensure integration of end-user metadata. • Schedule training on the new tool set. • If there are any data-stores attached to the original tool set, plan for the migration of the data to the new tool set.

7.8.6 Ongoing Fine-Tuning Techniques for fine-tuning OLTP systems are applied to the fine-tuning of the data warehouse. The techniques are very much similar except for one big difference: the data warehouse contains a lot more, in fact, many times more data than a typical OLTP system. The techniques will have to apply to an environment replete with mountains of data.

There may not be any point in repeating the indexing and other techniques that you already know from the OLTP environment. Following are a few practical suggestions: • Have a regular schedule to review the usage of indexes. Drop the indexes that are no longer used. • Monitor query performance daily. Investigate long-running queries. Work with the user groups that seem to be executing long-running queries. Create indexes if needed. • Analyse the execution of all predefined queries on a regular basis. RDBMSs have query analysers for this purpose.

153/JNU OLE Data Mining

• Review the load distribution at different times per day. Determine the reasons for large variations. • Although you have instituted a regular schedule for ongoing fine-tuning, from time to time, you will come across some queries that suddenly cause grief. You will hear complaints from a specific group of users. Be prepared for such ad hoc fine-tuning needs. The data administration team must have staff set apart for dealing with these situations.

7.9 Models of Data Mining Following are different models used for data mining explained in detail.

Claims fraud models The number of challenges facing the Property and Casualty insurance industry seems to have grown geometrically during the past decade. In the past, poor underwriting results and high loss ratio were compensated by excellent returns on investments. However, the performance of financial markets today is not sufficient to deliver the level of profitability that is necessary to support the traditional insurance business model. In order to survive in the bleak economic conditions that dictate the terms of today’s merciless and competitive market, insurers must change the way they operate to improve their underwriting results and profitability. An important element in the process of defining the strategies that are essential to ensure the success and profitable results of insurers is the ability to forecast the new directions in which claims management should be developed. This endeavour has become a crucial and challenging undertaking for the insurance industry, given the dramatic events of the past years in the insurance industry worldwide. We can check claims as they arrive and score them as to the likelihood of they are fraudulent. This can results in large savings to the insurance companies that use these technologies.

Customer clone models The process for selectively targeting prospects for your acquisition efforts often utilises a sophisticated analytical technique called “best customer cloning.” These models estimate which prospects are most likely to respond based on characteristics of the company’s “best customers”. To this end, we build the models or demographic profiles that allow you to select only the best prospects or “clones” for your acquisition programs. In a retail environment, we can even identify the best prospects that are close in proximity to your stores or distribution channels. Customer clone models are appropriate when insufficient response data is available, providing an effective prospect ranking mechanism when response models cannot be built.

Response models The best method for identifying the customers or prospects to target for a specific product offering is through the use of a model developed specifically to predict response. These models are used to identify the customers most likely to exhibit the behaviour being targeted. Predictive response models allow organisations to find the patterns that separate their customer base so the organisation can contact those customers or prospects most likely to take the desired action. These models contribute to more effective marketing by ranking the best candidates for a specific product offering thus identifying the low hanging fruit.

Revenue and profit predictive models Revenue and Profit Prediction models combine response/non-response likelihood with a revenue estimate, especially if order sizes, monthly billings, or margins differ widely. Not all responses have equal value, and a model that maximises responses does not necessarily maximise revenue or profit. Revenue and profit predictive models indicate those respondents who are most likely to add a higher revenue or profit margin with their response than other responders.

These models use a scoring algorithm specifically calibrated to select revenue-producing customers and help identify the key characteristics that best identify better customers. They can be used to fine-tune standard response models or used in acquisition strategies.

154/JNU OLE Cross-sell and up-sell models Cross-sell/up-sell models identify customers who are the best prospects for the purchase of additional products and services and for upgrading their existing products and services. The goal is to increase share of wallet. Revenue can increase immediately, but loyalty is enhanced as well due to increased customer involvement.

Attrition models Efficient, effective retention programs are critical in today’s competitive environment. While it is true that it is less costly to retain an existing customer than to acquire a new one, the fact is that all customers are not created equal. Attrition models enable you to identify customers who are likely to churn or switch to other providers thus allowing you to take appropriate pre-emptive action. When planning retention programs, it is essential to be able to identify best customers, how to optimise existing customers and how to build loyalty through “entanglement”. Attrition models are best employed when there are specific actions that the client can take to retard cancellation or cause the customer to become substantially more committed. The modelling technique provides an effective method for companies to identify characteristics of chumers for acquisition efforts and also to prevent or forestall cancellation of customers.

Marketing effectiveness creative models Often the message that is passed on to the customer is the one of the most important factors in the success of a campaign. Models can be developed to target each customer or prospect with the most effective message. In direct mail campaigns, this approach can be combined with response modelling to score each prospect with the likelihood they will respond given that they are given the most effective creative message (that is the one that is recommended by the model). In email campaigns this approach can be used to specify a customised creative message for each recipient.

Real time web personalisation with eNuggets Using our eNuggets real time data mining system websites can interact with site visitors in an intelligent manner to achieve desired business goals. This type of application is useful for eCommerce and CRM sites. eNuggets is able to transform Web sites from static pages to customised landing pages, built on the fly, that match a customer profile so that the promise of true one-to-one marketing can be realised. eNuggets is a revolutionary new business intelligence tool that can be used for web personalisation or other real time business intelligence purposes. It can be easily integrated with existing systems such as CRM, outbound telemarketing (that is intelligent scripting), insurance underwriting, stock forecasting, fraud detection, genetic research and many others. eNuggetsTM uses historical (either from company transaction data or from outside data) data to extract information in the form of English rules understandable by humans. The rules collectively form a model of the patterns in the data that would not be evident to human analysis. When new data comes in, such as a stock transaction from ticker data, e-NuggetsTM interrogates the model and finds the most appropriate rule to suggest which course of action will provide the best result (that is buy, sell or hold).

155/JNU OLE Data Mining

Summary • The logical model forms the primary basis for the physical model. • Many companies invest a lot of time and money to prescribe standards for information systems. The standards range from how to name the fields in the database to how to conduct interviews with the user departments for requirements definition. • Standards take on greater importance in the data warehouse environment. • If the data warehouse stores data only at the lowest level of granularity, every such query has to read through all the detailed records and sum them up. • If OLAP instances are not for universal use by all users, then the necessary aggregates must be present in the main warehouse. The aggregate database tables must be laid out and included in the physical model. • During the load process, the entire table must be closed to the users. • In the data warehouse, many of the data access patterns rely on sequential access of large quantities of data. • Preparing an indexing strategy is a crucial step in the physical design. Unlike OLTP systems, the data warehouse is query-centric. • The efficiency of the data retrieval is closely tied to where the data is stored in physical storage and how it is stored there. • Most of the leading DBMSs allow you to set block usage parameters at appropriate values and derive performance improvement. • Redundant array of inexpensive disks (RAID) technology has become common to the extent that almost all of today’s data warehouses make good use of this technology. • In a query-centric system like the data warehouse environment, the need to process queries faster dominates. • Bitmapped indexes are ideally suitable for low-selectivity data. • Once the data warehouse is designed, built and tested, it needs to be deployed so it is available to the user community. • Data warehouse management is concerned with two principal functions: maintenance management and change management.

References • Ponniah, P., 2001. DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals, Wiley-Interscience Publication. • Larose, D. T., 2006. Data mining methods and models, John Wiley and Sons. • Wan, D., 2007. Typical data warehouse deployment lifecycle [Online] Available at: . [Accessed 12 September 2011]. • Statsoft, Data Mining Techniques [Online] Available at: . [Accessed 12 September 2011]. • StatSoft, 2010. Data Mining, Model Deployment and Scoring - Session 30 [Video Online] Available at : < http:// www.youtube.com/watch?v=LDoQVbWpgKY>. [Accessed 12 September 2011]. • OracleVideo, 2010. Data Warehousing Best Practices Star Schemas [Video Online] Available at : < http://www. youtube.com/watch?v=LfehTEyglrQ>. [Accessed 12 September 2011].

Recommended Reading • Kantardzic, M., 2001. Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed., Wiley-IEEE. • Khan, A., 2003. Data Warehousing 101: Concepts and Implementation, iUniverse. • Rainardi, V., 2007. Building a data warehouse with examples in SQL Server, Apress.

156/JNU OLE Self Assessment 1. The logical model forms the ______basis for the physical model. a. primary b. secondary c. important d. former

2. If ______instances are not for universal use by all users, then the necessary aggregates must be present in the main warehouse. a. KDD b. DMKD c. OLAP d. SDMKD

3. What divides large database tables into manageable parts? a. Extraction b. Division c. Transformation d. Partitioning a. 4. Which of the following is a crucial step in the physical design? a. Preparing and indexing strategy b. Preparing web mining strategy c. Preparing data mining strategy d. Preparing OLAP strategy

5. Which statement is false? a. In the data warehouse, many of the data access patterns rely on sequential access of large quantities of data. b. Unlike data mining systems, the data warehouse is query-centric c. The sequence of the attributes in each index plays a critical role in performance. d. Scrutinise the attributes in each table to determine which attributes qualify for bit-mapped indexes.

6. In most cases, the supporting proprietary software dictates the storage and the retrieval of data in the ______system. a. KDD b. OLTP c. SDMKD d. OLAP

7. What is the full form of RAID? a. Redundant Arrangement of Inexpensive Disks b. Redundant Array of Information Disks c. Redundant Array of Inexpensive Disks d. Redundant Array of Inexpensive Database

157/JNU OLE Data Mining

8. Match the columns A. similar to mirroring, except here each drive has its own distinct 1. Disk mirroring controller

B. writing the same data to two disk drives connected to the same 2. Disk duplexing controller

C. addition of a parity bit to the data to ensure correct data 3. Parity checking transmission

4. Disk striping D. data spread across multiple disks by sectors or bytes

a. 1-A, 2-B, 3-C, 4-D b. 1-D, 2-C, 3-B, 4-A c. 1-B, 2-A, 3-C, 4-D d. 1-C, 2-B, 3-A, 4-D

9. Every action in the physical model takes place in______. a. physical storage b. data mining c. data warehousing d. disk mirroring

10. In a ______system like the data warehouse environment, the need to process queries faster dominates. a. OLAP b. query-centric c. OLTP d. B-Tree index

158/JNU OLE Case study I

Logic-ITA student data

We have performed a number of queries on datasets collected by the Logic-ITA to assist teaching and learning. The Logic-ITA is a web-based tutoring tool used at Sydney University since 2001, in a course taught by the second author. Its purpose is to help students practice logic formal proofs and to inform the teacher of the class progress.

Context of use Over the four years, around 860 students attended the course and used the tool, in which an exercise consists of a set of formulas (called premises) and another formula (called the conclusion). The aim is to prove that the conclusion can validly be derived from the premises. For this, the student has to construct new formulas, step by step, using logic rules and formulas previously established in the proof, until the conclusion is derived. There is no unique solution and any valid path is acceptable. Steps are checked on the fly and, if incorrect, an error message and possibly a tip are displayed. Students used the tool at their own discretion. A consequence is that there is neither a fixed number nor a fixed set of exercises done by all students.

Data stored The tool’s teacher module collates all the student models into a database that the teacher can query and mine. Two often queried tables of the database are the tables mistake and correct step. The most common variables are shown in Table 1.1.

login the student’s login id qid the question id mistake the mistake made rule the logic rule involved/used line the line number in the proof startdate date exercise was started finishdate date exercise was finished (or 0 if unfinished)

Table 1.1 Common variables in tables mistake and correct step

(Source: Merceron, A. and Yacef, K., Educational Data Mining: a Case Study, )

Questions 1. What is Logic-ITA? What is the purpose of Answer The Logic-ITA is a web-based tutoring tool used at Sydney University since 2001, in a course taught by the second author. Its purpose is to help students practice logic formal proofs and to inform the teacher of the class progress.

2. What is ‘premises’ and ‘conclusion’? Answer Over the four years, around 860 students attended the course and used the tool, in which an exercise consists of a set of formulas (called premises) and another formula (called the conclusion).

159/JNU OLE Data Mining

3. State the most common variables of database. Answer login the student’s login id qid the question id mistake the mistake made rule the logic rule involved/used line the line number in the proof Start date date exercise was started Finish date date exercise was finished (or 0 if unfinished)

160/JNU OLE Case Study II

A Case Study of Exploiting Data Mining Techniques for an Industrial Recommender System

In this case study, we aim to providing recommendations to the loyal customers of a chain of fashion retail stores based in Spain. In particular, the retail stores would like to be able to generate targeted product recommendations to loyal customers based on either customer demographics, customer transaction history, or item properties. A comprehensive description of the available dataset with the above information is provided in the next subsection. The transformation of this dataset into a format that can be exploited by Data Mining and Machine Learning techniques is described in sections below.

Dataset The dataset used for this case study contained data on customer demographics, transactions performed, and item properties. The entire dataset covers the period of 01/01/2007 – 31/12/2007.

There were 1,794,664 purchase transactions by both loyal and non-loyal customers. The average value of a purchased item was €35.69. We removed the transactions performed by non-loyal customers, which reduced the number of purchase transactions to 387,903 by potentially 357,724 customers. We refer to this dataset as Loyal. The average price of a purchased item was €37.81.

We then proceeded to remove all purchased items with a value of less than €0 because these represent refunds. This reduced the number of purchase transactions to 208,481 by potentially 289,027 customers. We refer to this dataset as Loyal-100.

Dataset Processing We processed the Loyal dataset to remove incomplete data for the demographic, item, and purchase transaction attributes.

Demographic Attributes Table 2.1 shows the four demographic attributes we used for this case study. The average item price attribute was not contained in the database; it was derived from the data.

Attribute Original Format Processed Codification

Date of birth String Numeric Age

Address String Province category

Gender String Gender category

Avg. Item price N/A Derived numeric value

Table 2.1 Demographic attributes

The date of birth attribute was provided in seven different valid formats, alongside several invalid formats. The invalid formats results in 17,125 users being removed from the Loyal dataset. The date of birth was further processed to produce the age of the user in years. We considered an age of less than 18 to be invalid because of the requirement for a loyal customer to be 18 years old to join the scheme; we also considered an age of more than 80 to be unusually old based on the life expectancy of a Spanish person. Customers with an age out with the 18 – 80 range were removed from the dataset. Customers without a gender, or a Not Applicable gender were removed from the Loyal-100 dataset. Finally, users who did not perform at least one transaction between 01/01/2007 and 31/12/2007 were removed from the dataset. An overview of the number of customers removed from the Loyal-100 dataset can be seen in Table 2.2.

161/JNU OLE Data Mining

Customer Data Issue Number Removed Invalid birth date 17125 Too young 3926 Too old 243 Invalid province 44215 Invalid gender 3188 No transactions performed 227297 Total users removed 295994 Customers 61730

Table 2.2 Customer attribute issues in loyal dataset

Item Attributes Table 2.3 presents the four item attributes we used for this case study

Attribute Original Format Processed Codification Designer String Designer category Composition String Composition category Price Decimal Numeric value Release season String Release season category

Table 2.3 Item attributes

The item designer, composition, and release season identifiers were translated to nominal categories. The price was kept in the original format and binned using the Weka toolkit. Items lacking complete data on any of the attributes were not including in the final dataset due to the problem of incomplete data.

Attribute Issues No. of items removed Invalid season 9344 No designer 10497 No composition 2788 Total items removed 22629 Items 6414

Table 2.4 Item attributes issues in loyal dataset

162/JNU OLE Purchase Transaction Attributes Table 2.5 presents the two transaction attributes we used for this case study

Attribute Original Format Processed Codification Date String Calendar season category Individual item price Decimal Numeric value

Table 2.5 Transaction attributes

The transaction date field was provided in one valid format and presented no parsing problems. The date of a transaction was codified into a binary representation of the calendar season(s) according to the scheme shown in Table 2.6. This codification scheme results in the “distance” between January and April being equivalent to the “distance” between September and December, which is intuitive.

Spring Summer Autumn Winter January 0 0 0 1 February 1 0 0 1 March 1 0 0 0 April 1 0 0 0 May 1 1 0 0 June 0 1 0 0 July 0 1 0 0 August 0 1 1 0 September 0 0 1 0 October 0 0 1 0 November 0 0 1 1 December 0 0 0 1

Table 2.6 Codifying transaction date to calendar season

The price of each item was kept in the original decimal format and binned using the Weka toolkit. We chose not to remove discounted items from the dataset. Items with no corresponding user were encountered when the user had been removed from the dataset due to an aspect of the user demographic attribute causing a problem. An overview of the number of item transactions removed from the loyal dataset based on the processing and codification step can be seen in Table 2.7.

Issue No. of transactions removed Refund 2300 Too expensive 6591 No item record 74089 No user record 96442 Total item purchases removed 179422 Remaining item purchases 208481

Table 2.7 Transaction attributes issued in Loyal dataset

163/JNU OLE Data Mining

As a result of performing these data processing and cleaning steps, we are left with a dataset we refer to as Loyal- Clean. An overview of the All, Loyal, and the processed and codified dataset, Loyal-Clean, is shown in Table 2.8.

All Loyal Loyal-Clean Item transactions 1794664 387903 208481 Customers N/A 357724 61730 Total items 29043 29043 6414 Avg. Items per customer N/A 1.08 3.38 Avg. Item value € 35.69 € 37.81 € 36.35

Table 2.8 Database observations

(Source: Cantadore, I., Elliott, D. and jose, J. M. A Case Study of Exploiting Data Mining Techniques for an Industrial Recommender System [PDF] Available at: . [Accessed 30 September 2011].)

Questions 1. Explain the dataset used in this case study. 2. How was demographic attributes used in the above case study? 3. Write a note on purchase transaction attributes used int his case study.

164/JNU OLE Case Study III

ABSTRACT

Data Mining is gaining popularity as an effective tool for increasing profits in a variety of industries. However, the quality of the information resulting from the data mining exercise is only as good as the underlying data. The importance of accurate, accessible data is paramount. A well designed data warehouse can greatly enhance the effectiveness of the data mining process. This paper will discuss the planning and development of a data warehouse for a credit card bank. While the discussion covers a number of aspects and uses of the data warehouse, a particular focus will be on the critical needs for data access pertaining to targeting model development.

The case study will involve developing a Lifetime Value model from a variety of data sources including account history, customer transactions, offer history and demographics. The paper will discuss the importance of some aspects of the physical design and maintenance to the data mining process.

INTRODUCTION One of the most critical steps in any data mining project is obtaining good data. Good data can mean many things: clean, accurate, predictive, timely, accessible and/or actionable. This is especially true in the development of targeting models. Targeting models are only as good as the data on which they are developed. Since the models are used to select names for promotions, they can have a significant financial impact on a company’s bottom line.

The overall objectives of the data warehouse are to assist the bank in developing a totally data driven approach to marketing, risk and customer relationship management. This would provide opportunities for targeted marketing programs. The analysis capabilities would include: • Response Modelling and Analysis • Risk or Approval Modelling and Analysis • Activation or Usage Modelling and Analysis • Lifetime Value or Net Present Value Modelling • Segmentation and Profiling • Fraud Detection and Analysis • List and Data Source Analysis • Sales Management • Customer Touchpoint Analysis • Total Customer Profitability Analysis

The case study objectives focus on the development of a targeting model using information and tools available through the data warehouse. Anyone who has worked with target model development knows that data extraction and preparation are often the most time consuming part of model development. Ask a group of analysts how much of their time spent preparing data is. A majority of them will say over 50%.

165/JNU OLE Data Mining

WHERE’S THE EFFORT

60

50

40

30

20

10

0 Business Data Data Mining Analysis of Objectives Preparation Results and Development Knowledge Accumlation

Fig. 3.1 Analysis on time spent for preparing data

Over the last 10 years, the bank had amassed huge amounts of information about our customer and prospects. The analysts and modellers knew there was a great amount of untapped value in the data. They just had to figure out a way to gain access to it. The goal was to design a warehouse that could bring together data from disparate sources into one central repository.

THE TABLES The first challenge was to determine which tables should go into the data warehouse. We had a number of issues: • Capturing response information • Storing transactions • Defining date fields

Response information Responses begin to arrive about a week after an offer is mailed. Upon arrival, the response is put through a risk screening process. During this time, the prospect is considered ‘Pending.’ Once the risk screening process is complete, the prospect is either ‘Approved’ or ‘Declined.’ The bank considered two different options for storing the information in the data warehouse. • The first option was to store the data in one large table. The table would contain information about those approved as well as those declined. Traditionally, across all applications, they saw approval rates hover around 50%. Therefore, whenever analyses was done on either the approved applications (with a risk management focus) or on the declined population (with a marketing as well as risk management focus), every query needed to go through nearly double the number of records as necessary. • The second option was to store the data in three small tables. This accommodated the daily updates and allowed for pending accounts to stay separate as they awaited information from either the applicant or another data source.

With applications coming from e-commerce sources, the importance of the “pending” table increased. This table was examined daily to determine which pending accounts could be approved quickly with the least amount of risk. In today’s competitive market, quick decisions are becoming a competitive edge. Partitioning the large customer profile table into three separate tables improved the speed of access for each of the three groups of marketing analysts who had responsibility for customer management, reactivation and retention, and activation. The latter group was responsible for both the one-time buyers and the prospect pools.

166/JNU OLE FILE STRUCTURE ISSUES Many of the tables presented design challenges. Structural features that provided ease of use for analysts could complicate the data loading process for the IT staff. This was a particular problem when it came to transaction data. This data is received on a monthly basis and consists of a string of transactions for each account for the month. This includes transactions such as balances, purchases, returns and fees. In order to make use of the information at a customer level it needs to be summarised. The question was how to best organize the monthly performance data in the data warehouse. Two choices were considered:

Long skinny file: this took the data into the warehouse in much the same form as it arrived.

Each month would enter the table as a separate record. Each year has a separate table. The fields represent the following:

Month01 Cust_1 VarA VarB VarC VarD VarE CDate C# Month01 Cust_1 VarA VarB VarC VarD VarE CDate C# Month03 Cust_1 VarA VarB VarC VarD VarE CDate C# Month04 Cust_1 VarA VarB VarC VarD VarE CDate C# | | | | | | Month12 Cust_1 VarA VarB VarC VarD VarE CDate C# Month01 Cust_2 VarA VarB VarC VarD VarE CDate C# Month02 Cust_2 VarA VarB VarC VarD VarE CDate C# Month03 Cust_2 VarA VarB VarC VarD VarE CDate C# Month04 Cust_2 VarA VarB VarC VarD VarE CDate C# | | | | | | Month12 Cust_2 VarA VarB VarC VarD VarE CDate C#

Wide file: this design has a single row per customer. It is much more tedious to update. But in its final form, it is much easier to analyze because the data has already been organized into a single customer record. Each year has a separate table. The layout is as follows:

Cust_1 VarA01 VarA02 VarA03 … VarA12 VarB01 VarB02 VarB03 … VarB12 VarC01 VarC02 VarC03 … VarC12 VarD01 VarD02 VarD03 … VarD12 VarE01 VarE02 VarE03 … VarE12 CDate C# Cust_2 VarA01 VarA02 VarA03 … VarA12 VarB01 VarB02 VarB03 … VarB12 VarC01 VarC02 VarC03 … VarC12 VarD01 VarD02 VarD03 … VarD12 VarE01 VarE02 VarE03 … VarE12 CDate C#

The final decision was to go with the wide file or the single row per customer design. The argument was that the manipulation to the customer level file could be automated thus making the best use of the analyst’s time.

DATE ISSUES Many analyses are performed using date values. In our previous situation, we saw how transactions are received and updated on a monthly basis. This is useful when comparing values of the same vintage. However, another analyst might need to compare balances at a certain stage in the customer lifestyle. For example, to track customer balance cycles from multiple campaigns a field that denotes the load date is needed.

The first type of analysis was tracking monthly activity by the vintage acquisition campaign. For example, calculating monthly trends of balances aggregated separately for those accounts booked in May 99 and September 99. This required aggregating the data for each campaign by the “load date” which corresponded to the month in which the transaction occurred.

167/JNU OLE Data Mining

The second analyses focused on determining and evaluating trends in the customer life cycle. Typically, customers who took a balance transfer at the time of acquisition showed balance run-off shortly after the introductory teaser APR rate expired and the account was reprised to a higher rate. These are the dreaded “rate surfers.” Conversely, a significant number of customers, who did not take a balance transfer at the time of acquisition, demonstrated balance build. Over time these customers continued to have higher than average monthly balances. Some demonstrated revolving behaviour: paying less than the full balance each month and a willingness to pay interest on the revolving balance. The remainder in this group simply user their credit cards for convenience. Even though they built balances through debit activity each month, they chose to pay their balances in full and avoid finance charges. These are the “transactors” or convenience users. The second type of analysis needed to use ‘Months on books’, regardless of the source campaign. This analysis required computation of the account age by looking at both the date the account was open as well as the “load date” of the transaction data. However, if the data mining task is to also understand this behaviour in the context of campaign vintage which was mentioned earlier, there is a another consideration. Prospects for the “May 99” campaign were solicited in May of 1999. However, many new customers did not use their card until June or July of 1999. There were three main reasons: • some wanted to compare their offer to other offers; • processing is slower during May and June; and • some waited until a specific event (e.g. purchase of a large present at the Christmas holidays) to use their card for the first time.

At this point the data warehouse probably needs to store at least the following date information: • Date of the campaign • Date the account first opened • Date of the first transaction • Load date for each month of data

The difference between either “b” or “c” above and “d” can be used as the measure used to index account age or month on books.

No single date field is more important than another but multiple date files are problem necessary if vintage as well as customer life-cycle analyses are both to be performed.

Developing the Model To develop a Lifetime Value model, we need to extract information from the Customer Information Table for risk indices as well as the Offer History Table for demographic and previous offer information.

Customer information table A Customer Information Table is typically designed with one record per customer. The customer table contains the identifying information that can be linked to other tables such as a transaction table to obtain a current snapshot of a customer’s performance. The following list details the key elements of the Customer Information Table:

Customer ID – a unique numeric or alpha-numeric code that identifies the customer throughout his entire lifecycle. This element is especially critical in the credit card industry where the credit card number may change in the event of a lost or stolen card. But it is essential in any table to effectively link and tract the behaviour of and actions taken on an individual customer.

Household ID – a unique numeric or alpha-numeric code that identifies the household of the customer through his or her entire lifecycle. This identifier is useful in some industries where products or services are shared by more than one member of a household.

Account number – a unique numeric or alpha-numeric code that relates to a particular product or service. One customer can have several account numbers.

168/JNU OLE Customer name – the name of a person or a business. It is usually broken down into multiple fields: last name, first name, middle name or initial, salutation.

Address – the street address is typically broken into components such as number, street, suite or apartment number, city, state, zip+4. Some customer tables have a line for a P.O. Box. With population mobility about 10% per year, additional fields that contain former addresses are useful for tracking and matching customers to other files. Phone number – current and former numbers for home and work.

Demographics – characteristics such as gender, age, income, etc. may be stored for profiling and modelling.

Products or services – the list of products and product identification numbers varies by company. An insurance company may list all the policies along with policy numbers. A bank may list all the products across different divisions of the bank including checking, savings, credit cards, investments, loans, and more. If the number of products and product detail is extensive, this information may be stored in a separate table with a customer and household identifier.

Offer detail – the date, type of offer, creative, source code, pricing, distribution channel (mail, telemarketing, sales rep, e-mail), and any other details of an offer. Most companies look for opportunities to cross-sell or up-sell their current customers. There could be numerous “offer detail” fields in a customer record, each representing an offer for an additional product or service.

Model Scores – response, risk, attrition, profitability scores and/or any other scores that are created or purchased.

Transaction table The Transaction Table contains records of customer activity. It is the richest and most predictive information but can be the most difficult to access. Each record represents a single transaction. So there are multiple records for each customer. In order to use this data for modelling, it must be summarized and aggregated to a customer level. The following lists key elements of the Transaction Table: Customer ID – defined above. Household ID – defined above.

Transaction Type – The type of credit card transaction such as charge, return, or fee (annual, overlimit, late). Transaction Date– The date of the transaction Transaction Amount – The dollar amount of the transaction.

Offer history table The Offer History Table contains details about offers made to prospects, customers or both. The most useful format is a unique record for each customer or prospect. Variables created from this table are often the most predictive in response and activation targeting models. It seems logical that if you know someone has received your offer every month for 6 months, they are less likely to respond than someone who is seeing your offer for the first time. As competition intensifies, this type of information is becoming increasing important. A Customer Offer History Table contains all cross-sell, up-sell and retention offers. A Prospect Offer History Table contains all acquisition offers as well as any predictive information from outside sources. It is also useful to store several addresses on the Prospect Offer History Table.

With an average amount of solicitation activity, this type of table can become very large. It is important to perform analysis to establish business rules that control the maintenance of this table. Fields like ‘date of first offer’ is usually correlated with response behaviour. The following list details some key elements in an Offer History Table:

Prospect ID/Customer ID – as in the Customer Information Table, this is a unique numeric or alphanumeric code that identifies the prospect for a specific length of time. This element is especially critical in the credit card industry where the credit card number may change in the event of a lost or stolen card. But it is essential in any table to effectively tract the behaviour of and actions taken on an individual customer.

169/JNU OLE Data Mining

Household ID – a unique numeric or alpha-numeric code that identifies the household of the customer through his entire lifecycle. This identifier is useful in some industries where products or services are shared by more than one member of a household.

Prospect name* – the name of a person or a business. It is usually broken down into multiple fields: last name, first name, middle name or initial, salutation.

Address* – the street address is typically broken into components such as number, street, suite or apartment number, city, state, zip+4. As in the Customer Table, some prospect tables have a line for a P.O. Box. Additional fields that contain former addresses are useful for matching prospects to outside files.

Phone number – current and former numbers for home and work.

Offer Detail – includes the date, type of offer, creative, source code, pricing, distribution channel (mail, telemarketing, sales rep, email) and any other details of the offer. There could be numerous groups of “offer detail” fields in a prospect or customer record, each representing an offer for an additional product or service.

Offer summary – date of first offer (for each offer type), best offer (unique to product or service), etc.

Model scores* – response, risk, attrition, profitability scores and/or any scores other that are created or purchased.

Predictive data* – includes any demographic, psychographic or behavioural data. *These elements appear only on a Prospect Offer History Table. The Customer Table would support the Customer Offer History Table with additional data.

Defining the objective The overall objective is to measure Lifetime Value (LTV) of a customer over a 3-year period. If we can predict which prospects will be profitable, we can target our solicitations only to those prospects and reduce our mail expense. LTV consists of four major components: • Activation - probability calculated by a model. Individual must respond, be approved by risk and incur a balance. • Risk – the probability of charge-off is derived from a risk model score. It is converted to an index. • Expected Account Profit – expected purchase, fee and balance behaviour over a 3-year period. • Marketing Expense - cost of package, mailing and processing (approval, fulfilment).

The data collection Names from three campaigns over the last 12 months were extracted from the Offer History Table. All predictive information was included in the extract: demographic and credit variables, risk scores and offer history. The expected balance behaviour was developed using segmentation analysis. An index of expected performance is displayed in a matrix of gender by marital status by age group (see Appendix A). The marketing expense which includes the mail piece and postage is $.78.

To predict Lifetime Value, data was pulled from the Offer History Table from three campaigns with a total of 966,856 offers. To reduce the amount of data for analysis and maintain the most powerful information, a sample is created using all of the ‘Activation’ and 1/25th of the remaining records. This includes non-responders and non-activating responders. We define an ACTIVE as a customer with a balance at three months. The following code creates the sample dataset:

170/JNU OLE DATA A B; SET LIB.DATA; IF 3MON_BAL > 0 THEN OUTPUT A; ELSE OUTPUT B; DATA LIB.SAMPDATA; SET A B (WHERE=(RANUNI(5555) < .04)); SAMP_WGT = 25; RUN

This code is putting into the sample dataset, all customers who activated and a 1/25th random sample of the balance of accounts. It also creates a weight variable called SAMP_WGT with a value of 25.

The following table displays the sample characteristics:

Campaign Sample Weight Non Resp/Non Active Resp 929075 37163 25 Responders/Active 37781 37781 1 Total 966856 74944

The non-responders and non-activated responders are grouped together since our target is active responders. This gives us a manageable sample size of 74,944.

Model developmnt The first component of the LTV, the probability of activation, is based on a binary outcome, which is easily modelled using logistic regression. Logistic regression uses continuous values to predict the odds of an event happening. The log of the odds is a linear function of the predictors. The equation is similar to the one used in linear regression with the exception of the use of a log transformation to the independent variable. The equation is as follows:

log(p/(1-p)) = B0 + B1X1 + B2X2 + …… + BnXn

Variable preparation - Dependent To define the dependent variable, create the variable ACTIVATE defined as follows:

IF 3MOBAL > 0 THEN ACTIVATE = 1; ELSE ACTIVATE = 0;

Variable preparation – previous offers The bank has four product configurations for credit card offers. Each product represents a different intro rate and intro length combination. From our offer history table, we pull four variables for modelling that represent the number of times each product was mailed in the last 6 months: NPROD1, NPROD2, NPROD3, and NPROD4.

Through analysis, the following variables were determined to be the most predictive.

SAM_OFF1 – received the same offer one time in the past 6 months. DIF_OFF1 – received a different offer one time in the past 6 months. SAM_OFF2 – received the same offer more than one time in the past 6 months. DIF_OFF2 – received a different offer more than one time in the past 6 months.

171/JNU OLE Data Mining

The product being modelled is Product 2. The following code creates the variables for modelling: SAM_OFF1 = (IF NPROD2 = 1); SAM_OFF2 = (IF NPROD2 > 1); DIF_OFF1 = (IF SUM(NPROD1, NPROD3, NPROD4) = 1); DIF_OFF2 = (IF SUM(NPROD1, NPROD3, NPROD4) > 1);

If the prospect has never received an offer, then the values for the four named variables will all be 0.

Preparing credit variables Since, logistic regression looks for a linear relationship between the independent variables and the log of the odds of the dependent variable, transformations can be used to make the independent variables more linear. Examples of transformations include the square, cube, square root, cube root, and the log. Some complex methods have been developed to determine the most suitable transformations. However, with the increased computer speed, a simpler method is as follows: create a list of common/favourite transformations; create new variables using every transformation for each continuous variable; perform a logistic regression using all forms of each continuous variable against the dependent variable. This allows the model to select which form or forms fit best. Occasionally, more than one transformation is significant. After each continuous variable has been processed through this method, select the one or two most significant forms for the final model. The following code demonstrates this technique for the variable Total Balance (TOT_BAL):

PROC LOGISTIC LIB.DATA: WEIGHT SMP_WGT; MODEL ACTIVATE = TOT_BAL TOT_B_SQ TOT_B_CU TOT_B_I TOT_B_LG / SELECTION=STEPWISE; RUN

The logistic model output (see Appendix D) shows two forms of TOT_BAL to be significant in combination: TOT_BAL TOT_B_SQ These forms will be introduced into the final model.

Partition data The data are partitioned into two datasets, one for model development, and one for validation. This is accomplished by randomly splitting the data in half using the following SAS® code: DATA LIB.MODEL LIB.VALID; SET LIB.DATA; IF RANUNI(0) < .5 THEN OUTPUT LIB.MODEL; ELSE OUTPUT LIB.VALID; RUN;

If the model performs well on the model data and not as well on the validation data, the model may be over-fitting the data. This happens when the model memorizes the data and fits the models to unique characteristics of that particular data. A good, robust model will score with comparable performance on both the model and validation datasets. As a result of the variable preparation, a set of ‘candidate’ variables has been selected for the final model. The next step is to choose the model options. The backward selection process is favoured by some modellers because it evaluates all of the variables in relation to the dependent variable while considering interactions among the independent or predictor variables. It begins by measuring the significance of all the variables and then removing one at a time until only the significant variables remain.

172/JNU OLE The sample weight must be included in the model code to recreate the original population dynamics. If you eliminate the weight, the model will still produce correct ranking-ordering but the actual estimates for the probability of a ‘paid-sale’ will be incorrect. Since our LTV model uses actual estimates, we will include the weights. The following code is used to build the final model.

PROC LOGISTIC LIB.MODEL: WEIGHT SMP_WGT; MODEL ACTIVATE = INQL6MO TOT_BAL TOT_B_SQ SAM_OFF1 DIF_OFF1 SAM_OFF2 DIF_OFF2 INCOME INC_LOG AGE_FILE NO30DAY TOTCRLIM POPDENS MAIL_ORD// SELECTION=BACKWARD; RUN;

The resulting model has 7 predictors. The parameter estimate is multiplied times the value of the variable to create the final probability. The strength of the predictive power is distributed like a chi-square so we look to that distribution for significance. The higher the chi-square, the lower is the probability of the event occurring randomly (pr > chi-square). The strongest predictor is the variable DIFOFF2 which demonstrates the power of offer history on the behaviour of a prospect. Introducing offer history variables into the acquisition modelling process has been single most significant improvement in the last three years. The following equation shows how the probability is calculated, once the parameter estimates have been calculated:

prob = exp(B0 + B1X1 + B2X2 + …… + BnXn)

(1+ exp(B0 + B1X1 + B2X2 + …… + BnXn))

This creates the final score, which can be evaluated using a gains table (see Appendix D). Sorting the dataset by the score and dividing it into 10 groups of equal volume creates the gains table. This is called a Decile Analysis. The validation dataset is also scored and evaluated in a gains table or Decile Analysis. Both of these tables show strong rank ordering. This can be seen by the gradual decrease in predicted and actual probability of ‘Activation’ from the top decile to the bottom decile. The validation data shows similar results, which indicates a robust model. To get a sense of the ‘lift’ created by the model, a gains chart is a powerful visual tool. The Y-axis represents the % of ‘Activation’ captured by each model. The X-axis represents the % of the total population mailed. Without the model, if you mail 50% of the file, you get 50% of the potential ‘Activation’. If you use the model and mail the same percentage, you capture over 97% of the ‘Activation’. This means that at 50% of the file, the model provides a ‘lift’ of 94% {(97-50)/50}.

Financial assessment To get the final LTV we use the formula: LTV = Pr(Paid Sale) * Risk Index Score* Expected Account Profit - Marketing Expense

At this point, we apply the risk matrix score and expected account profit value. The financial assessment shows the models ability to select the most profitable customers. Notice how the risk score index is lower for the most responsive customers. This is common in direct response and demonstrates ‘adverse selection’. In other words, the riskier prospects are often the most responsive. At some point in the process, a decision is made to mail a percent of the file. In this case, you could consider the fact that in decile 7, the LTV becomes negative and limit your selection to deciles 1 through 6. Another decision criterion could be that you need to be above a certain ‘hurdle rate’ to cover fixed expenses. In this case, you might look at the cumulative LTV to be above a certain amount such as $30. Decisions are often made considering a combination of criteria.

173/JNU OLE Data Mining

The final evaluation of your efforts may be measured in a couple of ways. You could determine the goal to mail fewer pieces and capture the same LTV. If we mail the entire file with random selection, we would capture $13,915,946 in LTV. This has a mail cost of $754,155. By mailing 5 deciles using the model, we would capture $14,042,255 in LTV with a mail cost of only $377,074. In other words, with the model we could capture slightly more LTV and cut our marketing cost in half. Or, we can compare similar mail volumes and increase LTV. With random selection at 50% of the file, we would capture $6,957,973 in LTV. Modelled, the LTV would climb to $14,042,255. This is a lift of over 100% ((14042255-6957973)/ 6957973 = 1.018).

Conclusion Successful data mining and predictive modelling depends on quality data that is easily accessible. A well constructed data warehouse allows for the integration of Offer History which has an excellent predictor of Lifetime Value.

(Source: Rud, C, O., Data Warehousing for Data Mining: A Case Study [PDF] Available at: . [Accessed 30 September 2011].)

Question 1. How many datasets is the data portioned into? 2. Which is the first challenge mentioned in the above case study? 3. What are the analysis capabilities?

174/JNU OLE Bibliography

References • Adriaans, P., 1996. Data Mining, Pearson Education India. • Alexander, D., Data Mining [Online] Available at: . [Accessed 9 September 2011]. • Berlingerio, M., 2009. Temporal mining for interactive workflow data analysis [Video Online] Available at: . [Accessed 12 September]. • Dr. Krie, H. P., Spatial Data Mining [Online] Available at: . [Accessed 9 September 2011]. • Dr. Kuonen, D., 2009. Data Mining Applications in Pharma/BioPharma Product Development [Video Online] Available at: . [Accessed 12 September 2011]. • Galeas, Web mining [Online PDF} Available at: . [Accessed 12 September 2011]. • Hadley, L., 2002. Developing a Data Warehouse Architecture [Online] Available at: . [Accessed 8 September 2011]. • Han, J. and Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd ed., Diane Cerra. • Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier. • http://nptel.iitm.ac.in, 2008. Lecture - 34 Data Mining and Knowledge Discovery [Video Online] Available at :< http://www.youtube.com/watch?v=m5c27rQtD2E>. [Accessed 12 September 2011]. • http://nptel.iitm.ac.in, 2008. Lecture - 35 Data Mining and Knowledge Discovery Part II [Video Online] Available at: . [Accessed 12 September 2011]. • Humphries, M., Hawkins, M. W. And Dy, M. C., 1999. Data warehousing: architecture and implementation, Prentice Hall Profesional. • Intricity101, 2011. What is OLAP? [Video Online] Available at: . [Accessed 12 September 2011]. • Kimball, R., 2006. The Data warehouse Lifecycle Toolkit, Wiley-India. • Kumar, A., 2008. Data Warehouse Layered Architecture 1 [Video Online] Available at: . [Accessed 11 September 2011]. • Larose, D. T., 2006. Data mining methods and models, John Wiley and Sons. • Learndatavault, 2009. Business Data Warehouse (BDW) [Video Online] Available at: . [Accessed 12 September 2011]. • Lin W., Orgun M, A. and Williams G. J., An Overview of Temporal Data Mining [Online PDF] Available at: . Accessed 9 September 2011]. • Liu, B., 2007. Web data mining: exploring hyperlinks, contents, and usage data, Springer. • Mailvaganam, H., 2007. Data Warehouse Project Management [Online] Available at: . [Accessed 8 September 2011]. • Maimom, O. and Rokach, L., 2005. Data mining and knowledge discovery handbook, Springer Science and Business. • Maimom, O. and Rokach, L., Introduction To Knowledge Discovery In Database [Online PDF] Available at: . [Accessed 9 September 2011]. • Mento, B. and Rapple, B., 2003. Data Mining and Warehousing [Online] Available at: . [Accessed 9 September 2011]. • Mitsa, T., 2009. Temporal Data Mining, Chapman & Hall/CRC. • OracleVideo, 2010. Data Warehousing Best Practices Star Schemas [Video Online] Available at: < http://www. youtube.com/watch?v=LfehTEyglrQ>. [Accessed 12 September 2011].

175/JNU OLE Data Mining

• Orli, R and Santos, F., 1996. Data Extraction, Transformation, and Migration Tools [Online] Available at: . [Accessed 9 September 2011]. • Ponniah, P., 2001. DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals, Wiley-Interscience Publication. • SalientMgmtCompany, 2011. Salient Visual Data Mining [Video Online] Available at: . [Accessed 12 September 2011]. • Seifert, J, W., 2004. Data Mining: An Overview [Online PDF] Available at: . [Accessed 9 September 2011]. • Springerlink, 2006. Data Mining System Products and Research Prototypes [Online PDF] Available at: . [Accessed 12 September 2011]. • SQLUSA, 2009. SQLUSA.com Data Warehouse and OLAP [Video Online] Available at: . [Accessed 12 September 2011]. • StatSoft, 2010. Data Mining, Cluster Techniques - Session 28 [Video Online] Available at: . [Accessed 12 September 2011]. • StatSoft, 2010. Data Mining, Model Deployment and Scoring - Session 30 [Video Online] Available at: . [Accessed 12 September 2011]. • Statsoft, Data Mining Techniques [Online] Available at: . [Accessed 12 September 2011]. • Swallacebithead, 2010. Using Data Mining Techniques to Improve Forecasting [Video Online] Available at: . [Accessed 12 September 2011]. • University of Magdeburg, 2007. 3D Spatial Data Mining on Document Sets [Video Online] Available at: . [Accessed 12 September 2011]. • Wan, D., 2007. Typical data warehouse deployment lifecycle [Online] Available at: . [Accessed 12 September 2011]. • Zaptron, 1999. Introduction to Knowledge-based Knowledge Discovery [Online] Available at: . [Accessed 9 September 2011].

Recommended Reading • Chang, G., 2001. Mining the World Wide Web: an information search approach, Springer. • Chattamvelli, R., 2011. Data Mining Algorithms, Alpha Science International Ltd. • Han, J., Kamber, M. 2006. Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann. • Jarke, M., 2003. Fundamentals of data warehouses, 2nd ed., Springer. • Kantardzic, M., 2001. Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed., Wiley-IEEE. • Khan, A., 2003. Data Warehousing 101: Concepts and Implementation, iUniverse. • Liu, B., 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed., Springer. • Markov, Z. and Larose, D. T., 2007. Data mining the Web: uncovering patterns in Web content, structure, and usage, Wiley-Interscience. • Parida, R., 2006. Principles & Implementation of Data Warehousing, Firewell Media. • Ponniah, P., 2001. Data Warehousing Fundamentals-A Comprehensive Guide for IT Professionals, Wiley- Interscience Publication. • Ponniah, P., 2010. Data Warehousing Fundamentals for IT Professionals, 2nd ed., John Wiley and Sons. • Prabhu, C. S. R., 2004. Data warehousing: concepts, techniques, products and applications, 2nd ed., PHI Learning Pvt. Ltd. • Pujari, A. K., 2001. Data mining techniques, 4th ed., Universities Press. • Rainardi, V., 2007. Building a data warehouse with examples in SQL Server, Apress. • Roddick, J. F. and Hornsby, K., 2001. Temporal, spatial, and spatio-temporal data mining, Springer.

176/JNU OLE • Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc (IGI). • Stein, A., Shi, W. and Bijker, W., 2008. Quality aspects in spatial data mining, CRC Press. • Thuraisingham, B. M., 1999. Data mining: technologies, techniques, tools, and trends, CRC Press. • Witten, I. H. and Frank, E., 2005. Data mining: practical machine learning tools and techniques, 2nd ed., Morgan Kaufmann.

177/JNU OLE Data Mining

Self Assessment Answers

Chapter I 1. a 2. d 3. c 4. d 5. a 6. d 7. c 8. b 9. a 10. d

Chapter II 1. b 2. a 3. d 4. a 5. b 6. b 7. c 8. c 9. a 10. a

Chapter III 1. a 2. c 3. b 4. d 5. a 6. d 7. b 8. b 9. c 10. a

Chapter IV 1. a 2. c 3. a 4. a 5. c 6. c 7. a 8. a 9. c 10. d 11. a

178/JNU OLE Chapter V 1. d 2. c 3. a 4. a 5. a 6. a 7. b 8. b 9. d 10. a

Chapter VI 1. c 2. a 3. b 4. d 5. c 6. a 7. b 8. b 9. b 10. a

Chapter VII 1. a 2. c 3. d 4. a 5. b 6. d 7. c 8. c 9. a 10. b

179/JNU OLE