An Introduction to Data Mining

AnAn introductionintroduction toto datadata miningmining Dr I Kolyshkina Dr R G Brookes AgendaAgenda

• Introduction • Examples • Techniques – Decision trees, MARS, hybrids – Model evaluation WhatWhat isis DataData Mining?Mining?

• Data mining is – the process of exploration and analysis, by automatic means, of large quantities of data in order to discover meaningful patterns and rules (Michael Berry and Gordon Linoff, “Mastering Data Mining”, 2000) WhatWhat isis DataData Mining?Mining?

• Data mining methods – impose no assumptions or structure on the data and therefore allow the data to “teach” us – do not replace traditional statistical methods but complement them WhatWhat isis DataData Mining?Mining?

• Why now? – Rapid increase in computer power in the last decade – Increase in availability of large data sets and recognition of untapped value – Traditional methods often are not powerful enough to provide such analysis – Availability of computer-intense analytical methodologies (data mining) DataData MiningMining TechniquesTechniques

• Data miners will typically use methods such as: – Tree based methods • CART • Multivariate Adaptive Regression Splines (MARS) • TreeNet (MART) – Neural Networks – Clustering DataData MiningMining TechniquesTechniques

• Best data mining methods share these features • Automatically select predictors to use in models • Capable of dealing with flawed, noisy and incomplete data • Provide clear presentation of results and useful feedback to analysts • Data mining methods robust modelling structure • Learning, test and validation data ExampleExample –– ClaimClaim StreamingStreaming

• Took recent claims from WorkCover database • All information available at first report • Calculated response variable • Analysis using CART • Trained on approx 150,000,tested on 50,000 • Added contextual information also • Combined information to “score” each claim ExampleExample –– ClaimClaim StreamingStreaming ExampleExample –– CustomerCustomer ValueValue

• AIM: To build a predictive Customer Value Model for health insurance – That is to place each customer on the following curve

Customer Value $5,000

-$5,000

-$10,000

-$15,000 Customer value ($)

-$20,000

Less Profitable Unprofitable Profitable Customers Customers Customers

Retention/acquisition Structural Treatment/prevention strategies changes strategies ExampleExample –– CustomerCustomer ValueValue

Premium Income

Costs Reinsurance Projection of Profit Cash Flows

Lapses Family Make-up

Product Choice ExampleExample –– CustomerCustomer ValueValue

• The most important predictors for customer value were • Age and gender • Current and previous product • Previous utilisation • Duration and state • Payment channel and distribution channel • Family status • Socio economic status ExampleExample –– CustomerCustomer ValueValue

• The model treats in a structured and cohesive way the different effects of – Health utilisation – Customer behaviour – Socio-economic and demographic characteristics ExampleExample –– CustomerCustomer ValueValue

• Crosses over “product silos” • Hospital utilisation impacts ancillary claim patterns • Ancillary claims patterns impact hospital utilisation • Allows for future contingencies • Costs on a per member basis • Income on a per membership basis • Projected family make-up including marriage, child birth and divorce ExampleExample –– CustomerCustomer ValueValue ExampleExample –– CustomerCustomer ValueValue Techniques CARTCART DecisionDecision TreesTrees

• Divide data into exactly two subgroups (or “nodes”) • Split based on questions that have a "yes" or "no" answer – “is age greater than 20?” • Compare all the splits • Select split with the highest degree of homogeneity HowHow isis aa CARTCART DecisionDecision TreeTree Built?Built? All Observations 40,000 pos. resp. - 29% 100,000 neg. resp. - 71%

Split on AGE<20

Age<20 Age>=20 30,000 pos. resp. - 60% 10,000 pos. resp. - 11% 20,000 neg. resp. - 40%80,000 neg. resp. - 89% CARTCART DecisionDecision TreesTrees

• Process is then repeated for each of the resulting nodes until further splitting is impossible or stopped • Model is pruned by comparing learning and test data • Resulting model is called a decision tree • This selection will require a lot of computation but computation is exactly what a computer is good at! Decision Tree – High Level

Yes All population No Diagnostic Code = 8-12, 14,15, 17, 18 14% positive

Node 4 Node 2 Previous health history long stay off work? 20% positive 4% positive Node 5 Age<50? 17% positive Node 6 Socioeconomic code 12% positive

Node 8 Node 7 Previous hospitalisation? Age < 28? Node 9 12% positive 10% positive English-speaking background? 11% positive Node 3 Occupation Node 10 14% positive postcode 10% positive Node 11 Occupation 15% positive

Terminal Terminal Terminal Terminal Terminal Terminal Terminal Terminal Terminal Terminal Terminal Terminal Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 2 % pos. 8% pos. 22%pos. 6% pos. 9% pos. 9% pos. 18% pos. 44% pos. 33%pos. 36% pos. 29% pos. 65% pos.

29% 3% 2% 11% 20% 3% 5% 1% 1% 3% 17% 4% Total 100% AdvantagesAdvantages ofof CARTCART DecisionDecision TreesTrees • Easy to interpret, pictorially represented • Automatically and quickly select the important predictors • Can handle missing values • Can handle large number of categories in a predictor • Unaffected by outliers • Quick computationally DisadvantagesDisadvantages ofof DecisionDecision TreesTrees • Lack of continuity in the predicted value – a small change in a predictor could lead to a large change in predicted value of the response • Model is “coarse-grained” – a tree with N nodes can only predict N different probabilities or values DisadvantagesDisadvantages ofof DecisionDecision TreesTrees • Not so good with continuous or multilevel response variables – Although good as part of hybrid model • Difficulty in modelling linear structure • Some tree-building algorithms produce unstable trees CombinationCombination ofof DecisionDecision TreesTrees withwith otherother methodsmethods • Can combine decision trees with other modelling methodologies – Use decision tree for quick selection of the important predictors then build a linear model such as a GLM – Build hybrid models • output of a decision tree model used as input for continuous model such as GLM, logistic regression or MARS (or vice versa) MARSMARS

• Continuous model • Generalization of stepwise linear regression and modification of the CART (decision tree) method • MARS model is based on “basis” piecewise linear functions e.g. • AGE1 is equal to (AGE-20) if AGE>20, and is = 0 otherwise • AGE2 is equal to (20-AGE) if AGE<=20, and is = 0 otherwise MARSMARS versusversus CARTCART

• CART is – faster – requires less data preparation • MARS is – more affected by missing values and outliers EvaluationEvaluation

• Accuracy • By examining classification tables, actual versus expected charts • Stability – By testing on “non-learning” data – CART does this automatically • Ranking – by examining the gains chart GainsGains ChartChart

Probability of Positive Response

100% 90% 80% baseline 70% 60% % of actual events captured 50% in the top X% 40% theoretical best 30% 20% 10% 0%

% % % % % Top x% 0 15 30 45 60 75% 90%