Title

Introduction to Mining

Dr Arulsivanathan Naidoo South Africa

OECD Conference Cape Town 8-10 December 2010

Your reference 1 Outline

• Introduction • Statistics vs Knowledge Discovery • Predictive Modeling • Data Mining Examples • 2011 • ROC • Conclusions

Your reference 2 Introduction

• What is Data Mining?

• Data Mining is a general term.

• Data mining is defined a an application of intelligent techniques such as decision trees, Neural networks, fuzzy logic genetic algorithm, nearest neighbour method, rule induction and data visualization to large quantities of data to discover hidden trends, patterns and relationships (Lam and Kamber 2006)

Your reference 3 Orgins of Data Mining

Statistics

Pattern Neurocomputing Recognition

Data Mining Machine Learning

Databases Artificial Intellengence

KDD

Your reference 4 Hypothesis Testing

Statement Top

Hypothesis

Analysis

Decision Down Accept H0

Your reference 5 Knowledge Discovery

Statement Up

Answer Beer

What item is purchased with Question? disposable Baby Napkins?

Data Bottom

Your reference 6 Unsupervised learning

Data

Association Items bought together

Disassociation Items not bought together

Sequential Items bought in order

Cluster / SOM Grouping- Segments

Your reference 7 Supervised Learning Target Data Variable

Decision Tree Regression Neural Network

Two Stage

Your reference 8 What is a Model? One Word Equation

Straight Line Y = mX + c

Example: Countryside

Your reference 9 Decision Tree

A decision tree model is constructed by segmenting a dataset using a series of simple rules, resulting in a hierarchy of segments within segments

Algorithms such as the CHAID (chi squared automatic interactive detection) can be used to decide on how to split the segments.

The hierarchy is called a tree and each segment a node

Your reference 10 Decision Tree

100 M 100 W

Short Hair Long Hair

Earings No Earings

Predict everyone with short hair and earings is female

Your reference 11 Regression

x1

x2 Y

x3

Your reference 12 Neural Networks

H1 x1

x2 H2 Y

x3

Inputs Outputs

Black Box

Your reference 13 Two Stage

Buy from every Catalogue R100

Buy from Catalogue once/year R5000

Your reference 14 Eurostat Funding

KESO ( Knowledge extraction for statistical offices) This is a Eurostat project with the goal to construct a versatile efficient industrial strength data mining system that satisfies the needs of providers large scale databases

SPIN (Spatial mining for data of public interest) was developed to support statistical offices in their timely and cost effective dissemination of statistical data by integrating the state of the art GIS and data mining functionality in an open highly extensible internet enabled plug in architecture

IDSA (Intelligent Data Control System) Hassain et al 2010 This is an application of data mining to the official statistics

Your reference 15 NASS

Decision Trees Census Non Response Weighting

Census Mail List Trimming

Analysis of reporting Errors

Allocation of Survey Incentives

Prediction of Survey Non Respondents

Your reference 16 NASS

Cluster Analysis

•2007 Census Donor Pool Screening • design and Construction •Identifying Subtypes of records Missing from the Census Mail List

Association Analysis •Survey Data Edit design

Your reference 17 Examples

• Absa Branch Robberies • Old Mutual Policies • MTN prepaid • HSBC Bank Credit Cards • Royal Saudi Air Force • Census 2011

Your reference 18 Census 2011

Model A Results Sample () Model B Assess Score Data

Model C

Census 2001 High Will Informal Wall Respond Areas Areas

Your reference 19 Prediction Types

Training Data Predictions

Case 1 : inputs target Decisions Case 2 : inputs target Case 3 : inputs target Case 4 : inputs target

Case 5 : inputs target Estimates

Your reference 20 Prediction Types

Training Data Decisions

Case 1 : inputs target Success Case 2 : inputs target Failure Case 3 : inputs target Failure Case 4 : inputs target Success Case 5 : inputs target Success

Your reference 21 Prediction Types

Training Data Rankings

Case 1 : inputs target 680 Case 2 : inputs target 720 Case 3 : inputs target 640 Case 4 : inputs target 582 Case 5 : inputs target 635

Your reference 22 Prediction Types

Training Data Estimates

Case 1 : inputs target 0.45 Case 2 : inputs target 0.53 Case 3 : inputs target 0.62 Case 4 : inputs target 0.55 Case 5 : inputs target 0.47

Your reference 23 Prediction Validation Fit Direction Type

Decisions Misclassification Smallest Average Profit/Loss Largest/Smallest Kolmogorov-Smirnov Largest Statistic Rankings ROC Index (Concordance) Largest Gini Coefficient Largest

Estimates Average Square Error Smallest Schwarz’s Bayesian Smallest Criterion Largest Log-likelihood

Your reference 24 Confusion Matrix

Actual male female

True a False c male Positive Positive Predicted False b True d female Negative Negative

Your reference 25 ROC

a • Sensitivity = ------

a+b

d • Specificity = ------

c+d

Your reference 26 ROC

The ROC (Receiver Operating Characteristic) curve was first used during World War 2 following the attacks on Pearl harbour in 1941. The US army research the prediction of correctly detecting Japanese aircraft from their radar signals

Your reference 27 ROC Curve

Your reference 28 Conclusion

Data mining is a growing discipline which originated outside statistics in the data base community mainly for commercial purposes Today data Mining can be considered a branch of exploratory statistics where useful models and patterns are uncovered through the extensive use of algorithms

Finally who should analyse huge data sets, the National statistics Offices or other research institutions

Data mining techniques use individual records not There is by law the confidentiality clause The NSO are the best place and this will imply new directions of research

Your reference 29 Conclusion

Official statistics should be a field for data mining giving new life and value to its huge data bases, but this may imply a redefinition of the visions and missions of official statistics offices

South Africa changed its vision and mission this year

In Statistics South Africa we have acquired data mining software and we have started a data mining user group of over 100 researchers We are hoping to start a working paper series where some of this research will be published on our website for comments

Your reference 30 Thank you

Your reference 31