Title
Introduction to Data Mining
Dr Arulsivanathan Naidoo Statistics South Africa
OECD Conference Cape Town 8-10 December 2010
Your reference 1 Outline
• Introduction • Statistics vs Knowledge Discovery • Predictive Modeling • Data Mining Examples • Census 2011 • ROC • Conclusions
Your reference 2 Introduction
• What is Data Mining?
• Data Mining is a general term.
• Data mining is defined a an application of intelligent techniques such as decision trees, Neural networks, fuzzy logic genetic algorithm, nearest neighbour method, rule induction and data visualization to large quantities of data to discover hidden trends, patterns and relationships (Lam and Kamber 2006)
Your reference 3 Orgins of Data Mining
Statistics
Pattern Neurocomputing Recognition
Data Mining Machine Learning
Databases Artificial Intellengence
KDD
Your reference 4 Hypothesis Testing
Statement Top
Hypothesis
Analysis
Decision Down Accept H0
Your reference 5 Knowledge Discovery
Statement Up
Answer Beer
What item is purchased with Question? disposable Baby Napkins?
Data Bottom
Your reference 6 Unsupervised learning
Data
Association Items bought together
Disassociation Items not bought together
Sequential Items bought in order
Cluster / SOM Grouping- Segments
Your reference 7 Supervised Learning Target Data Variable
Decision Tree Regression Neural Network
Two Stage
Your reference 8 What is a Model? One Word Equation
Straight Line Y = mX + c
Example: Countryside
Your reference 9 Decision Tree
A decision tree model is constructed by segmenting a dataset using a series of simple rules, resulting in a hierarchy of segments within segments
Algorithms such as the CHAID (chi squared automatic interactive detection) can be used to decide on how to split the segments.
The hierarchy is called a tree and each segment a node
Your reference 10 Decision Tree
100 M 100 W
Short Hair Long Hair
Earings No Earings
Predict everyone with short hair and earings is female
Your reference 11 Regression
x1
x2 Y
x3
Your reference 12 Neural Networks
H1 x1
x2 H2 Y
x3
Inputs Outputs
Black Box
Your reference 13 Two Stage
Buy from every Catalogue R100
Buy from Catalogue once/year R5000
Your reference 14 Eurostat Funding
KESO ( Knowledge extraction for statistical offices) This is a Eurostat project with the goal to construct a versatile efficient industrial strength data mining system that satisfies the needs of providers large scale databases
SPIN (Spatial mining for data of public interest) was developed to support statistical offices in their timely and cost effective dissemination of statistical data by integrating the state of the art GIS and data mining functionality in an open highly extensible internet enabled plug in architecture
IDSA (Intelligent Data Control System) Hassain et al 2010 This is an application of data mining to the official statistics
Your reference 15 NASS
Decision Trees Census Non Response Weighting
Census Mail List Trimming
Analysis of reporting Errors
Allocation of Survey Incentives
Prediction of Survey Non Respondents
Your reference 16 NASS
Cluster Analysis
•2007 Census Donor Pool Screening •Questionnaire design and Construction •Identifying Subtypes of records Missing from the Census Mail List
Association Analysis •Survey Data Edit design
Your reference 17 Examples
• Absa Branch Robberies • Old Mutual Policies • MTN prepaid • HSBC Bank Credit Cards • Royal Saudi Air Force • Census 2011
Your reference 18 Census 2011
Model A Results Sample (Ranking) Model B Assess Score Data
Model C
Census 2001 High Will Informal Wall Respond Areas Areas
Your reference 19 Prediction Types
Training Data Predictions
Case 1 : inputs target Decisions Case 2 : inputs target Case 3 : inputs target Rankings Case 4 : inputs target
Case 5 : inputs target Estimates
Your reference 20 Prediction Types
Training Data Decisions
Case 1 : inputs target Success Case 2 : inputs target Failure Case 3 : inputs target Failure Case 4 : inputs target Success Case 5 : inputs target Success
Your reference 21 Prediction Types
Training Data Rankings
Case 1 : inputs target 680 Case 2 : inputs target 720 Case 3 : inputs target 640 Case 4 : inputs target 582 Case 5 : inputs target 635
Your reference 22 Prediction Types
Training Data Estimates
Case 1 : inputs target 0.45 Case 2 : inputs target 0.53 Case 3 : inputs target 0.62 Case 4 : inputs target 0.55 Case 5 : inputs target 0.47
Your reference 23 Prediction Validation Fit Statistic Direction Type
Decisions Misclassification Smallest Average Profit/Loss Largest/Smallest Kolmogorov-Smirnov Largest Statistic Rankings ROC Index (Concordance) Largest Gini Coefficient Largest
Estimates Average Square Error Smallest Schwarz’s Bayesian Smallest Criterion Largest Log-likelihood
Your reference 24 Confusion Matrix
Actual male female
True a False c male Positive Positive Predicted False b True d female Negative Negative
Your reference 25 ROC
a • Sensitivity = ------
a+b
d • Specificity = ------
c+d
Your reference 26 ROC
The ROC (Receiver Operating Characteristic) curve was first used during World War 2 following the attacks on Pearl harbour in 1941. The US army research the prediction of correctly detecting Japanese aircraft from their radar signals
Your reference 27 ROC Curve
Your reference 28 Conclusion
Data mining is a growing discipline which originated outside statistics in the data base community mainly for commercial purposes Today data Mining can be considered a branch of exploratory statistics where useful models and patterns are uncovered through the extensive use of algorithms
Finally who should analyse huge data sets, the National statistics Offices or other research institutions
Data mining techniques use individual records not aggregate data There is by law the confidentiality clause The NSO are the best place and this will imply new directions of research
Your reference 29 Conclusion
Official statistics should be a field for data mining giving new life and value to its huge data bases, but this may imply a redefinition of the visions and missions of official statistics offices
South Africa changed its vision and mission this year
In Statistics South Africa we have acquired data mining software and we have started a data mining user group of over 100 researchers We are hoping to start a working paper series where some of this research will be published on our website for comments
Your reference 30 Thank you
Your reference 31