MKTG-5963 Data Mining and CRM Applications
Total Page:16
File Type:pdf, Size:1020Kb
MKTG-5963 – Data Mining and CRM Applications
MKTG 5963 Data Mining and CRM Applications Group 3 PAKDD Competition
April 15, 2007
Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 1 MKTG-5963 – Data Mining and CRM Applications
Table of Contents CRISP-DM: The Process...... 3 Phase 1: Business Understanding...... 3 Phase 2: Data Understanding...... 3 Phase 3: Data Preparation...... 4 Phase 4: Modeling...... 5 Phase 5: Evaluation...... 7 Phase 6: Deployment...... 9 Appendix...... 10
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 2 MKTG-5963 – Data Mining and CRM Applications
CRISP-DM: The Process
Phase 1: Business Understanding The PAKDD competition is based on a cross-selling problem. The issue is how to identify those credit-card holders that are most likely to be viable targets for cross-selling home- loans.
Phase 2: Data Understanding The competition data was supplied in 3 MS-Excel files. Modeling, Prediction, and a Data Dictionary. SAS 9.1 was employed to import the provided Excel files and format as SAS tables. Both data-sets were then imported into SASEM.
The initial Create Data Source tasks resulted in all b_* variables classified as Segment, most of the rest as nominal or interval.
The variables were reviewed and reclassified as needed, full results are available in the appendix. Several variables were rejected due to the number of missing values or lack of variance (98% - 99% all one value)
Class (binary & nominal) variables Variable Role Numcat NMiss Mode ModePct Mode2 Mode2Pct AMEX_CARD INPUT 3 3714 N 89.99 9.13 ANNUAL_INCOME_RANGE INPUT 8 0 150K -< 240K 45.66 240K -< 360K 27.38 A_DISTRICT_APPLICANT INPUT 9 0 2 36.46 4 23.44 CHQ_ACCT_IND INPUT 3 1 N 73.94 Y 26.06 CREDIT_CARD_TYPE INPUT 2 0 C 85.02 B 14.98 CUSTOMER_SEGMENT INPUT 13 1740 4 15.82 6 15.6 DINERS_CARD INPUT 3 3885 N 90.31 9.55 DVR_LIC INPUT 2 0 1 98.08 0 1.92 MARITAL_STATUS INPUT 7 0 C 49.5 E 23.31 MASTERCARD INPUT 4 2951 N 88.82 7.25 NBR_OF_DEPENDANTS INPUT 24 0 0 85.02 15 6.74 OCCN_CODE INPUT 6 0 M 39.73 R 26.53 RENT_BUY_CODE INPUT 3 3710 N 89.9 9.12 RETAIL_CARDS INPUT 4 2 Y 89.37 N 10.62 SAV_ACCT_IND INPUT 3 2484 N 88.12 6.1 VISA_CARD INPUT 2 0 0 98.28 1 1.72 TARGET_FLAG TARGET 2 0 0 98.28 1 1.72
Note that DVR_LIC (driver’s license) and TARGET_FLAG are set to binary, all others are nominal.
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 3 MKTG-5963 – Data Mining and CRM Applications
Interval variables Variable Role Mean StdDev Non-Missing Missing Min Median Max AGE_AT_APPLICATION INPUT 38.548 11.34 40700 0 18 38 84 A_TOTAL_AMT_DELQ INPUT 1.8 33.8 40700 0 0 0 2322 A_TOTAL_BALANCES INPUT 754.539 4320.22 40700 0 -5994 0 123456 A_TOTAL_NBR_ACCTS INPUT 0.167 0.42 40700 0 0 0 4 B_ENQ_L12M_GR1 INPUT 1.478 9.7 40700 0 0 0 99 B_ENQ_L12M_GR2 INPUT 1.627 9.7 40700 0 0 0 99 B_ENQ_L12M_GR3 INPUT 1.516 9.71 40700 0 0 0 99 B_ENQ_L1M INPUT 1.179 9.7 40700 0 0 0 99 B_ENQ_L3M INPUT 1.529 9.7 40700 0 0 0 99 B_ENQ_L6M INPUT 2.038 9.71 40700 0 0 1 99 B_ENQ_L6M_GR1 INPUT 1.244 9.7 40700 0 0 0 99 B_ENQ_L6M_GR2 INPUT 1.311 9.7 40700 0 0 0 99 B_ENQ_L6M_GR3 INPUT 1.247 9.7 40700 0 0 0 99 B_ENQ_LAST_WEEK INPUT 1.039 9.7 40700 0 0 0 99 CURR_EMPL_MTHS INPUT 75.128 84.15 40700 0 0 48 1000 CURR_RES_MTHS INPUT 87.336 92.79 40700 0 0 51 723 NBR_OF_DEPENDANTS INPUT 0.9 1.16 40700 0 0 0 15 PREV_EMPL_MTHS INPUT 6.929 30.03 40700 0 0 0 480 PREV_RES_MTHS INPUT 23.622 52.37 40700 0 0 1 624 TOTAL_NBR_CREDIT_CARDSINPUT 0.133 0.46 40700 0 0 0 20
A series of sample nodes were used to create a set of 10 balanced sample (700/700) data- sets. SAS code was then used to export the samples to the project library, and then imported into SASEM via Create Data Source. Prior Probabilities were then set for each sample DS. These data-sets were used for all modeling efforts up until final model selection.
Phase 3: Data Preparation Taking our balanced data-sets as a starting point, we moved into clean-up and preparation. Filter, Transform, Replace, and Impute nodes were all employed. Initial settings were established based on our study of the data, and multiple tests were performed to ascertain the effects of any changes. Full configuration parameters, for the filter and transform nodes, can be found in the appendix.
Summary of data preparation:
Filter: All non-rejected input b_* vars clipped to remove 98&99 from top end. These are company special values and do not denote customer information.
Transform: Due to strong skew and kurtosis on many of the variables, we tried a number of transforms (Max Normal, Max Correlation, Optimal, and Standardize) with model runs between each change so that we could evaluate the results. Max Normal and Optimal provided the greatest positive impact, with Max Normal winning out when applied to the original modeling data-set. Following are some of the more striking results. However, it should be noted that every transform was verified as adding to the efficacy of the production model.
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 4 MKTG-5963 – Data Mining and CRM Applications
Replacement: For credit cards, unknown values are to be represented by ‘X’
Impute: Left at default, CUSTOMER_SEGMENT was the only var processed.
Phase 4: Modeling As shown in figure 4, initial evaluations were done using a combination of Decision Tree, Logistic Regression, various Neural Networks, and Ensemble node, all feeding into a Model Comparison. Providing input to these modeling streams was a set of 10 Balanced Sample data-sets (one at a time of course). Approximately 70 modeling runs were performed, varying the input data-set, transformations, selection parameters, etc, to zero in on a best
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 5 MKTG-5963 – Data Mining and CRM Applications fit for Validation ROC. Once the initial modeling “bench” was built, most of our focus turned to various data transformations in an attempt to increase the ROC result of the overall model-set.
Figure 1 - Initial test and evaluation modeling space (work-bench)
Decision Tree (DT, DT-Gini, DT-Entropy): Feed directly from Data-Partition node, no data modification
Selection methods used: Default, Gini, Entropy
Gini consistently out performed the other two selection methods
Logistic Regression (Reg): Fed by Filter->Transform->Replacement->Impute data stream
Backward selection with 0.25 as entry, .05 as retain
Selection method: Validation Misclassification
Polynomial Logistic Regression (Reg-Poly): Fed by the logistic regression
Multiple runs with Polynomials only or Polynomial + 2-factor interactions
Neural Networks:
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 6 MKTG-5963 – Data Mining and CRM Applications Neural Network and Auto Neural nodes used (3 each)
Feeds from Filter->Transform->Replacement->Impute data stream, Reg, and Reg- Poly
Auto Neural nodes configuration: Train action: Search Max iterations: 50 Tolerance: Low Activation Functions: Direct, Normal, Sine, Tanh
Neural Network nodes configuration: Selection method: Validation Misclassification Max iterations: 500 Hidden units: 10
Phase 5: Evaluation As shown below, we went through several modeling phases prior to settling on our final selection.
Our initial modeling efforts employed the previously mentioned modeling work-bench and the balanced sample datasets. The final results of these tests can be seen in figure 2. As shown we achieved a respectable ROC (68.85% area under the curve) with a Logistic Regression model using a Selection Model of Backward and Selection Criteria of Validation Misclassification.
At this point we felt we had the right filtering and transforms in place, and so to further refine our evaluation we ran the work-bench with the full model data-set as provided. We then eliminated models one-by-one that did not add to the ROC of the ensemble node. This led us to the reduced set of modeling nodes shown in figure 3. During these tests it became clear that with the larger set of data, the Auto-Neural node fed from the data transform stream was out-performing all others.
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 7 MKTG-5963 – Data Mining and CRM Applications
Valid: Testing Average Valid: Valid: Valid: Valid: Squared MisclassificationRoc Valid: Valid: Percent Capture Error Rate Index Gain Lift Response Response AutoNeural 0.26987 0.50000 0.49773 27.317 1.273 2.190 6.366 AutoNeural2 0.28395 0.50000 0.57575 82.336 1.823 3.136 9.117 AutoNeural3 0.28395 0.50000 0.57575 82.336 1.823 3.136 9.117 Ensmbl 0.45841 0.50000 0.68519 270.370 3.704 6.370 18.519 Reg 0.22565 0.49430 0.68582 274.400 3.744 6.440 18.720 Reg2 0.22843 0.48860 0.68853 298.860 3.989 6.860 19.943 Reg3-Poly 0.23024 0.48433 0.68318 287.464 3.875 6.664 19.373 Tree 0.25000 0.50000 0.50000 0.000 1.000 1.720 5.000 Tree2 0.23852 0.49003 0.60104 220.059 3.201 5.505 16.003 Tree3 0.25000 0.50000 0.50000 0.000 1.000 1.720 5.000
Figure 2 - Best results of modeling with balanced sample data-sets (sds1)
Valid: Final Average Valid: Valid: Valid: Valid: Comparison Squared MisclassificationRoc Valid: Valid: Percent Capture Error Rate Index Gain Lift Response Response AutoNeural 0.01685 0.01720 0.67490 293.162 3.932 6.781 19.658 Ensmbl 0.01704 0.01725 0.67317 287.464 3.875 6.682 19.373 Reg 0.01756 0.01793 0.66848 247.578 3.476 5.995 17.379
Figure 3 - Final set of candidate models
As a modeling validation step, we then built a fresh diagram (figure 4) and ran our final model. The results are shown in figure 5 and match that found in our previous results. This then was used to score the PAKDD prediction data-set.
Figure 4 - Final modeling work-space
Valid: Production Average Valid: Valid: Valid: Valid: Release Squared MisclassificationRoc Valid: Valid: Percent Capture Error Rate Index Gain Lift Response Response AutoNeural 0.01685 0.01720 0.67490 293.162 3.932 6.781 19.658
Figure 5 - Final release candidate
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 8 MKTG-5963 – Data Mining and CRM Applications Phase 6: Deployment In the real world we would now build a deployable model, and commence further test, refinement, and validation. As it is, we will be turning this work over for evaluation by our instructor.
Overall this project using real world data was a positive experience. It allowed us to utilize the steps of the CRISP-DM model (one of us had prior training in this protocol from a previous class), and definitely stretched our modeling muscles.
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 9 MKTG-5963 – Data Mining and CRM Applications
Appendix Initial variable settings.
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 10 MKTG-5963 – Data Mining and CRM Applications Final filter settings
Final transformation settings
Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 11