MKTG-5963 Data Mining and CRM Applications

MKTG-5963 – Data Mining and CRM Applications

MKTG 5963 Data Mining and CRM Applications Group 3 PAKDD Competition

April 15, 2007

Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler

Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 1 MKTG-5963 – Data Mining and CRM Applications

Table of Contents CRISP-DM: The Process...... 3 Phase 1: Business Understanding...... 3 Phase 2: Data Understanding...... 3 Phase 3: Data Preparation...... 4 Phase 4: Modeling...... 5 Phase 5: Evaluation...... 7 Phase 6: Deployment...... 9 Appendix...... 10

CRISP-DM: The Process

Phase 1: Business Understanding The PAKDD competition is based on a cross-selling problem. The issue is how to identify those credit-card holders that are most likely to be viable targets for cross-selling home- loans.

Phase 2: Data Understanding The competition data was supplied in 3 MS-Excel files. Modeling, Prediction, and a Data Dictionary. SAS 9.1 was employed to import the provided Excel files and format as SAS tables. Both data-sets were then imported into SASEM.

The initial Create Data Source tasks resulted in all b_* variables classified as Segment, most of the rest as nominal or interval.

The variables were reviewed and reclassified as needed, full results are available in the appendix. Several variables were rejected due to the number of missing values or lack of variance (98% - 99% all one value)

Class (binary & nominal) variables Variable Role Numcat NMiss Mode ModePct Mode2 Mode2Pct AMEX_CARD INPUT 3 3714 N 89.99 9.13 ANNUAL_INCOME_RANGE INPUT 8 0 150K -< 240K 45.66 240K -< 360K 27.38 A_DISTRICT_APPLICANT INPUT 9 0 2 36.46 4 23.44 CHQ_ACCT_IND INPUT 3 1 N 73.94 Y 26.06 CREDIT_CARD_TYPE INPUT 2 0 C 85.02 B 14.98 CUSTOMER_SEGMENT INPUT 13 1740 4 15.82 6 15.6 DINERS_CARD INPUT 3 3885 N 90.31 9.55 DVR_LIC INPUT 2 0 1 98.08 0 1.92 MARITAL_STATUS INPUT 7 0 C 49.5 E 23.31 MASTERCARD INPUT 4 2951 N 88.82 7.25 NBR_OF_DEPENDANTS INPUT 24 0 0 85.02 15 6.74 OCCN_CODE INPUT 6 0 M 39.73 R 26.53 RENT_BUY_CODE INPUT 3 3710 N 89.9 9.12 RETAIL_CARDS INPUT 4 2 Y 89.37 N 10.62 SAV_ACCT_IND INPUT 3 2484 N 88.12 6.1 VISA_CARD INPUT 2 0 0 98.28 1 1.72 TARGET_FLAG TARGET 2 0 0 98.28 1 1.72

Note that DVR_LIC (driver’s license) and TARGET_FLAG are set to binary, all others are nominal.

Interval variables Variable Role Mean StdDev Non-Missing Missing Min Median Max AGE_AT_APPLICATION INPUT 38.548 11.34 40700 0 18 38 84 A_TOTAL_AMT_DELQ INPUT 1.8 33.8 40700 0 0 0 2322 A_TOTAL_BALANCES INPUT 754.539 4320.22 40700 0 -5994 0 123456 A_TOTAL_NBR_ACCTS INPUT 0.167 0.42 40700 0 0 0 4 B_ENQ_L12M_GR1 INPUT 1.478 9.7 40700 0 0 0 99 B_ENQ_L12M_GR2 INPUT 1.627 9.7 40700 0 0 0 99 B_ENQ_L12M_GR3 INPUT 1.516 9.71 40700 0 0 0 99 B_ENQ_L1M INPUT 1.179 9.7 40700 0 0 0 99 B_ENQ_L3M INPUT 1.529 9.7 40700 0 0 0 99 B_ENQ_L6M INPUT 2.038 9.71 40700 0 0 1 99 B_ENQ_L6M_GR1 INPUT 1.244 9.7 40700 0 0 0 99 B_ENQ_L6M_GR2 INPUT 1.311 9.7 40700 0 0 0 99 B_ENQ_L6M_GR3 INPUT 1.247 9.7 40700 0 0 0 99 B_ENQ_LAST_WEEK INPUT 1.039 9.7 40700 0 0 0 99 CURR_EMPL_MTHS INPUT 75.128 84.15 40700 0 0 48 1000 CURR_RES_MTHS INPUT 87.336 92.79 40700 0 0 51 723 NBR_OF_DEPENDANTS INPUT 0.9 1.16 40700 0 0 0 15 PREV_EMPL_MTHS INPUT 6.929 30.03 40700 0 0 0 480 PREV_RES_MTHS INPUT 23.622 52.37 40700 0 0 1 624 TOTAL_NBR_CREDIT_CARDSINPUT 0.133 0.46 40700 0 0 0 20

A series of sample nodes were used to create a set of 10 balanced sample (700/700) data- sets. SAS code was then used to export the samples to the project library, and then imported into SASEM via Create Data Source. Prior Probabilities were then set for each sample DS. These data-sets were used for all modeling efforts up until final model selection.

Phase 3: Data Preparation Taking our balanced data-sets as a starting point, we moved into clean-up and preparation. Filter, Transform, Replace, and Impute nodes were all employed. Initial settings were established based on our study of the data, and multiple tests were performed to ascertain the effects of any changes. Full configuration parameters, for the filter and transform nodes, can be found in the appendix.

Summary of data preparation:

 Filter: All non-rejected input b_* vars clipped to remove 98&99 from top end. These are company special values and do not denote customer information.

 Transform: Due to strong skew and kurtosis on many of the variables, we tried a number of transforms (Max Normal, Max Correlation, Optimal, and Standardize) with model runs between each change so that we could evaluate the results. Max Normal and Optimal provided the greatest positive impact, with Max Normal winning out when applied to the original modeling data-set. Following are some of the more striking results. However, it should be noted that every transform was verified as adding to the efficacy of the production model.

 Replacement: For credit cards, unknown values are to be represented by ‘X’

 Impute: Left at default, CUSTOMER_SEGMENT was the only var processed.

Phase 4: Modeling As shown in figure 4, initial evaluations were done using a combination of Decision Tree, Logistic Regression, various Neural Networks, and Ensemble node, all feeding into a Model Comparison. Providing input to these modeling streams was a set of 10 Balanced Sample data-sets (one at a time of course). Approximately 70 modeling runs were performed, varying the input data-set, transformations, selection parameters, etc, to zero in on a best

Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 5 MKTG-5963 – Data Mining and CRM Applications fit for Validation ROC. Once the initial modeling “bench” was built, most of our focus turned to various data transformations in an attempt to increase the ROC result of the overall model-set.

Figure 1 - Initial test and evaluation modeling space (work-bench)

Decision Tree (DT, DT-Gini, DT-Entropy):  Feed directly from Data-Partition node, no data modification

 Selection methods used: Default, Gini, Entropy

 Gini consistently out performed the other two selection methods

Logistic Regression (Reg):  Fed by Filter->Transform->Replacement->Impute data stream

 Backward selection with 0.25 as entry, .05 as retain

 Selection method: Validation Misclassification

Polynomial Logistic Regression (Reg-Poly):  Fed by the logistic regression

 Multiple runs with Polynomials only or Polynomial + 2-factor interactions

Neural Networks:

Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 6 MKTG-5963 – Data Mining and CRM Applications  Neural Network and Auto Neural nodes used (3 each)

 Feeds from Filter->Transform->Replacement->Impute data stream, Reg, and Reg- Poly

 Auto Neural nodes configuration: Train action: Search Max iterations: 50 Tolerance: Low Activation Functions: Direct, Normal, Sine, Tanh

 Neural Network nodes configuration: Selection method: Validation Misclassification Max iterations: 500 Hidden units: 10

Phase 5: Evaluation As shown below, we went through several modeling phases prior to settling on our final selection.

Our initial modeling efforts employed the previously mentioned modeling work-bench and the balanced sample datasets. The final results of these tests can be seen in figure 2. As shown we achieved a respectable ROC (68.85% area under the curve) with a Logistic Regression model using a Selection Model of Backward and Selection Criteria of Validation Misclassification.

At this point we felt we had the right filtering and transforms in place, and so to further refine our evaluation we ran the work-bench with the full model data-set as provided. We then eliminated models one-by-one that did not add to the ROC of the ensemble node. This led us to the reduced set of modeling nodes shown in figure 3. During these tests it became clear that with the larger set of data, the Auto-Neural node fed from the data transform stream was out-performing all others.

Valid: Testing Average Valid: Valid: Valid: Valid: Squared MisclassificationRoc Valid: Valid: Percent Capture Error Rate Index Gain Lift Response Response AutoNeural 0.26987 0.50000 0.49773 27.317 1.273 2.190 6.366 AutoNeural2 0.28395 0.50000 0.57575 82.336 1.823 3.136 9.117 AutoNeural3 0.28395 0.50000 0.57575 82.336 1.823 3.136 9.117 Ensmbl 0.45841 0.50000 0.68519 270.370 3.704 6.370 18.519 Reg 0.22565 0.49430 0.68582 274.400 3.744 6.440 18.720 Reg2 0.22843 0.48860 0.68853 298.860 3.989 6.860 19.943 Reg3-Poly 0.23024 0.48433 0.68318 287.464 3.875 6.664 19.373 Tree 0.25000 0.50000 0.50000 0.000 1.000 1.720 5.000 Tree2 0.23852 0.49003 0.60104 220.059 3.201 5.505 16.003 Tree3 0.25000 0.50000 0.50000 0.000 1.000 1.720 5.000

Figure 2 - Best results of modeling with balanced sample data-sets (sds1)

Valid: Final Average Valid: Valid: Valid: Valid: Comparison Squared MisclassificationRoc Valid: Valid: Percent Capture Error Rate Index Gain Lift Response Response AutoNeural 0.01685 0.01720 0.67490 293.162 3.932 6.781 19.658 Ensmbl 0.01704 0.01725 0.67317 287.464 3.875 6.682 19.373 Reg 0.01756 0.01793 0.66848 247.578 3.476 5.995 17.379

Figure 3 - Final set of candidate models

As a modeling validation step, we then built a fresh diagram (figure 4) and ran our final model. The results are shown in figure 5 and match that found in our previous results. This then was used to score the PAKDD prediction data-set.

Figure 4 - Final modeling work-space

Valid: Production Average Valid: Valid: Valid: Valid: Release Squared MisclassificationRoc Valid: Valid: Percent Capture Error Rate Index Gain Lift Response Response AutoNeural 0.01685 0.01720 0.67490 293.162 3.932 6.781 19.658

Figure 5 - Final release candidate

Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 8 MKTG-5963 – Data Mining and CRM Applications Phase 6: Deployment In the real world we would now build a deployable model, and commence further test, refinement, and validation. As it is, we will be turning this work over for evaluation by our instructor.

Overall this project using real world data was a positive experience. It allowed us to utilize the steps of the CRISP-DM model (one of us had prior training in this protocol from a previous class), and definitely stretched our modeling muscles.

Appendix Initial variable settings.

Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 10 MKTG-5963 – Data Mining and CRM Applications Final filter settings

Final transformation settings

Group 3: Dana Gray, Jesse Montgomery, Esther Thais, Steven Wheeler Page 11