<<

Mining:Overview

WhatisDataMining? • Recently* coinedtermforconfluenceofideasfrom andcomputerscience(machinelearning andmethods)appliedtolarge inscience,engineeringandbusiness. • Inastateofflux,manydefinitions,lotofdebate aboutwhatitisandwhatitisnot.Terminologynot standarde.g.bias,classification,prediction,feature =independentvariable,target=dependent variable,case=exemplar=row. * FirstInternationalworkshoponKnowledgeDiscovery andDataMiningwasin1995

1 BroadandNarrowDefinitions

• BroadDefinitionincludestraditional statisticalmethods,NarrowDefinition emphasizesautomatedandheuristic methods • Datamining,datadredging,fishing expeditions • KnowledgeDiscoveryinDatabases(KDD)

MyFavorite

• “Statisticsatscaleandspeed” DarrylPregibon

• Myextension: – “...Andsimplicity”

2 GartnerGroup

• “Dataminingistheprocessofdiscovering meaningfulnewcorrelations,patternsand trendsbysiftingthroughlargeamountsof datastoredinrepositories,usingpattern recognitiontechnologiesaswellas statisticalandmathematicaltechniques.”

3 Drivers • :Fromfocusonproduct/servicetofocuson customer • IT:Fromfocusonup-to-datebalancestofocuson patternsintransactions- DataWarehouses- OLAP • Dramaticdropinstoragecosts:Hugedatabases – e.g:20milliontransactions/day,10terabyte database,Blockbuster:36millionhouseholds • AutomaticDataCaptureofTransactions – e.g.BarCodes,POSdevices,Mouseclicks,Location data(GPS,cellphones) • :Personalizedinteractions,longitudinal data

CoreDisciplines

• Statistics(adaptedfor21stcenturydatasizesand speedrequirements).Examples: – Descriptive: – Models(DMD):Regression,ClusterAnalysis • MachineLearning:e.g.NeuralNets • DataBaseRetrieval:e.g.AssociationRules • Paralleldevelopments:e.g.Treemethods,k NearestNeighbors,OLAP-EDA

4 Process 1. Developunderstandingofapplication,goals 2. Createdatasetforstudy(oftenfromData Warehouse) 3. DataCleaningandPreprocessing 4. DataReductionandprojection Data 5. ChooseDataMiningtask Mining 6. ChooseDataMining 7. Usealgorithmstoperformtask 8. Interpretanditeratethru1-7ifnecessary 9. Deploy:integrateintooperationalsystems.

SEMMAMethodology(SAS) • Samplefromdatasets,Partitioninto Training,ValidationandTestdatasets • Exploredatasetstatisticallyandgraphically • Modify:Transformvariables,Impute missingvalues • Model:fitmodelse.g.regression, classfication tree,neuralnet • Assess:ComparemodelsusingPartition, Testdatasets

5 IllustrativeApplications

• CustomerRelationshipManagement

• Finance

• E-commerceandInternet

CustomerRelationship Management • TargetMarketing • AttritionPrediction/ChurnAnalysis • FraudDetection • CreditScoring

6 Targetmarketing

• Businessproblem:Uselistofprospectsfor directmailingcampaign • Solution:UseDataMiningtoidentifymost promisingrespondentscombining demographicandgeographicdatawithdata onpastpurchasebehavior • Benefit:Betterresponserate,savingsin campaigncost

Example:FleetFinancialGroup

• Redesignofcustomerserviceinfrastructure, including$38millioninvestmentindata warehouseandmarketingautomation • Usedlogisticregressiontopredictresponse tohome-equityproductforsampleof 20,000customerprofilesfrom15million customerbase • UsedCARTtopredictprofitablecustomersand customerswhowouldbeunprofitableevenifthey respond

7 ChurnAnalysis:Telcos

• BusinessProblem:Preventlossofcustomers, avoidaddingchurn-pronecustomers • Solution:Useneuralnets,timeseriesanalysisto identifytypicalpatternsoftelephoneusageof likely-to-defectandlikely-to-churncustomers • Benefit:Retentionofcustomers,moreeffective promotions

Example:FranceTelecom

• CHURN/CustomerProfilingSystemimplemented aspartofmajorcustomdatawarehousesolution • PreventiveCPSbasedoncustomercharacteristics andknowncasesofchurningandnon-churning customersidentifysignificantcharacteristicsfor churn • EarlydetectionCPSbasedonusagepattern matchingwithknowncasesofchurncustomers.

8 FraudDetection

• Businessproblem:Fraudincreasescostsor reducesrevenue • Solution:Uselogisticregression,neural netstoidentifycharacteristicsoffraudulent casestopreventinfutureorprosecutemore vigorously • Benefit:Increasedprofitsbyreducing undesirablecustomers

Example:AutomobileInsurance BureauofMassachusetts • Pastreportsonclaimsadjustorsscrutinizedby expertstoidentifycasesoffraud • Severalcharacteristics(over60)ofclaimant,type ofaccident,typeofinjury/treatmentcodedinto database • DimensionReductionmethodsusedtoobtain weightedvariables.MultipleRegressionStep-wise Subsetselectionmethodsusedtoidentify characteristicsstrongcorrelatedwithfraud

9 RiskAnalysis

• Businessproblem:Reduceriskofloansto delinquentcustomers • Solution:Usecreditscoringmodelsusing discriminant analysistocreatescore functionsthatseparateoutriskycustomers • Benefit:Decreaseincostofbaddebts

Finance

• Businessproblem:Pricingofcorporate bondsdependsonseveralfactors,risk profileofcompany,seniorityofdebt, dividends,priorhistory,etc. • SolutionApproach:ThroughDM,develop moreaccuratemodelsofpredictingprices.

10 E-commerceandInternet

• CollaborativeFiltering • FromClickstoCustomers

Recommendationsystems

• Businessopportunity:Usersrateitems (.com,CDNOW.com,MovieFinder.com) ontheweb.Howtouseinformationfromother userstoinferratingsforaparticularuser? • Solution:Useofatechniqueknownas collaborativefiltering • Benefit:Increaserevenuesbycrossselling,up selling

11 ClickstoCustomers

• Businessproblem:50%ofDell’sclientsorder theircomputerthroughtheweb.However,the retentionrateis0.5%,i.e.ofvisitorsofDell’sweb pagebecomecustomers. • SolutionApproach:Throughthesequenceoftheir clicks,clustercustomersanddesignwebsite, interventionstomaximizethenumberof customerswhoeventuallybuy. • Benefit:Increaserevenues

EmergingMajorDataMining applications • Spam • /Genomics • MedicalHistoryData– InsuranceClaims • Personalizationofservicesine-commerce • RFTags:Gillette • Security: – ContainerShipments – NetworkIntrusionDetection

12 CoreConcepts • TypesofData: – Numeric • Continuous– ratioandinterval • Discrete • NeedforBinning – Categorical– orderandunordered – Binary • andGeneralization • Regularization:Penaltyformodelcomplexity • Distance • CurseofDimensionality • Randomandstratifiedsampling, resampling • LossFunctions

13 Typicalcharacteristicsofmining data • “Standard”formatis: – Row=observationunit,Column=variable • Manyrows,manycolumns • Manyrowsmoderatenumberofcolumns(e.g.tel. calls) • Manycolumns,moderatenumberofrows(e.g. genomics) • Opportunistic(oftenby-productoftransactions) – Notfromdesignedexperiments – Oftenhas,missingdata

14 CourseTopics • SupervisedTechniques – Classification: • k-NearestNeighbors,NaïveBayes,ClassificationTrees • Discriminant Analysis,LogisticRegression,NeuralNets – Prediction(Estimation): • Regression,RegressionTrees,k-NearestNeighbors • UnsupervisedTechniques – ClusterAnalysis,PrincipalComponents – AssociationRules,CollaborativeFiltering

15