Data Mining Techniques by Berry and Linoff (2Nd Edition): Chapter 5 (Pages 123-163)

DSC 433/533 – Homework 5

Reading

“Data Mining Techniques” by Berry and Linoff (2nd edition): chapter 5 (pages 123-163).

Exercises

Hand in answers to the following questions at the beginning of the first class of week 6. The questions are based on the Charles Book Club Case (see separate document on the handouts page of the course website) and the Excel dataset Charles.xls (available on the data page of the course website).

1. In homework 1 you made a Standard Partition of the data into Training, Validation, and Test samples with 1800/1400/800 observations in each sample. Find this file and open it in Excel (alternatively open Charles.xls and re-do the partition: select all variables to be in the partition, use a random seed of 79963, and specify percentages of 45%, 35%, and 20% respectively for each sample). Do a logistic regression analysis using the R/F/M variables to estimate the probability that a customer purchases “The Art History of Florence.”  Select XLMiner > Classification > Logistic Regression.  Move M, R, and F to the “Input variables” box and Florence to the “Output variable” box.  Hit “Next” twice.  Select “Summary report” for “Score training data.”  Select “Detailed report,” “Summary report,” and “Lift charts” for “Score validation data.”  De-select “Summary report” for “Score test data” (we won’t be using test data for this assignment). We can use the results of the logistic regression analysis to estimate the probability that a customer will purchase as 1/

(1+exp(–b0 – b1M – b2R – b3F)), where b0, b1, b2, and b3 are the estimated regression coefficients. You should be able to find in the results that b0 = –2.52, b1 = 0.00064, b2 = –0.032, and b3 = 0.089. So, for example, the first validation sample customer, row ID 2 (with M=128, R=8, and F=2), has purchase probability: 1/(1+exp(2.52 –0.00064(128) + 0.032(8) – 0.089(2))) = 1/(1+exp(2.51608)) = 0.075. To turn in: According to the results of the analysis, which customers are more likely to purchase, those with low or high values of M? Those with low or high values of R? Those with low or high values of F? Look back at the definitions of M, R, and F in the case to see if this makes sense.

2. Construct a Lift Chart for purchases for the Validation sample. Recall that this has cumulative offers sent (on the horizontal axis) versus cumulative expected purchases (on the vertical axis). From part 1 above, the estimated

probability of purchase for each customer is 1/(1+exp(–b0 – b1M – b2R – b3F)). You can find these estimated probabilities on the “LR_ValidScore1” worksheet in the column headed “Prob. for 1 (success)” next to the column headed “Actual Class” which shows whether the customer made a purchase or not.  Select and copy these two columns (from E7 to F1407), and paste them into the K and L columns.  Select Data > Sort to sort the K and L columns in descending order using “Prob. for 1 (success)” in column L to sort on.  Put consecutive integers from 1 to 1400 into column M (e.g., type “1” into cell M8, then select and drag the bottom right corner of the cell while holding down the “Control” key to put consecutive integers into cells M8 to M1407).  Put cumulative expected purchases from the logistic regression analysis into column N (e.g., type “=SUM(K$8:K8)” into cell N8, then copy this cell into cells N8 to N1407).  Put cumulative expected purchases from a simple random sample of customers into column O (e.g., type “=N$1407*M8/M$1407” into cell O8, then copy this cell into cells O8 to O1407).  Type suitable titles for columns M, N, and O, e.g., rows 7 to 10 should be:

Prob. for 1 Cumulative Logistic Random Actual Class (success) offers regression sample 0 0.217896736 1 0 0.087142857 0 0.210465912 2 0 0.174285714 1 0.20966871 3 1 0.261428571 Then highlight cells M7 to O1407, start the “Chart Wizard” by clicking on the chart icon ( ), select “XY (Scatter)” for the Chart type, and select “Scatter with data points connected by lines” for the Chart sub-type. To turn in: Label the resulting chart by selecting “Chart > Chart Options,” and labeling the X-axis “offers sent,” labeling the Y-axis “purchases,” and titling the chart “Lift Chart for Purchases.” (Hint: you can check your lift chart looks OK because XLMiner should have drawn one automatically on the “LR_ValidLiftChart1” worksheet.)

3. Construct a Lift Chart for profit for the Validation sample. Assume that it costs $0.50 to send an offer to a customer, and that Charles Book Club receives $6.00 for each copy of “The Art History of Florence” purchased. Use results from the table in part 2 to construct the chart.  Select and copy cumulative offers in column M (from M7 to M1407), and paste it into the P column.  Put cumulative expected profit from the logistic regression analysis into column Q (e.g., type “=6*N8-0.5*P8” into cell Q8, then copy this cell into cells Q8 to Q1407).  Put cumulative expected profit from a simple random sample of customers into column R (e.g., type “=6*O8- 0.5*P8” into cell R8, then copy this cell into cells R8 to R1407).  Rows 7 to 10 should be: Cumulative Logistic Random offers regression sample 1 -0.5 0.022857143 2 -1 0.045714286 3 4.5 0.068571429

Then highlight cells P7 to R1407, start the “Chart Wizard” by clicking on the chart icon, select “XY (Scatter)” for the Chart type, and select “Scatter with data points connected by lines” for the Chart sub-type. To turn in: Label the resulting chart by selecting “Chart > Chart Options,” and labeling the X-axis “offers sent,” labeling the Y-axis “profit,” and titling the chart “Lift Chart for Profit.” Maximum profit occurs if we send offers to all customers with probability of purchase > 0.0816. Make a note of the number of such customers and the corresponding profit.

4. Do a logistic regression analysis using the following 15 predictor variables: Gender, M, R, F, FirstPurch, ChildBks, YouthBks, CookBks, DoItYBks, RefBks, ArtBks, GeogBks, ItalCook, ItalAtlas, ItalArt. (i.e., use all but Seq#, ID#, Mcode, Rcode, Fcode, and Related Purchase). Follow the same instructions as in parts 1, 2, and 3 to construct Lift Charts for both purchases and profit. Rows 7 to 10 of columns W to AD in the LRValidScore2 worksheet should be:

Prob. for 1 Cumulative Logistic Random Cumulative Logistic Random Actual Class (success) offers regression sample offers regression sample 1 0.619936298 1 1 0.087142857 1 5.5 0.022857143 1 0.504007617 2 2 0.174285714 2 11 0.045714286 0 0.378642044 3 2 0.261428571 3 10.5 0.068571429

Each of these columns is derived as follows: W: copied from column E X: copied from column F (remember to sort columns W and X in descending order using column X) Y: consecutive integers Z: =SUM(W$8:W8) AA: =Z$1407*Y8/Y$1407 AB: copied from column Y AC: =6*Z8-0.5*AB8 AD: =6*AA8-0.5*AB8 To turn in: Maximum profit for this logistic regression analysis occurs if we send offers to all customers with probability of purchase > 0.07788. Make a note of the number of such customers and the corresponding profit.

5. Do a logistic regression analysis using a subset of 7 of the 15 predictor variables from part 4. Experiment with the “Best subset” option at step 2 (try “exhaustive search”) of the XLMiner logistic regression analysis to find the 7 most useful predictors. Then re-do the analysis using only these 7 predictors and follow the same instructions as in parts 1, 2, and 3 to construct Lift Charts for both purchases and profit. Rows 7 to 10 of columns O to V should be:

Prob. for 1 Cumulative Logistic Random Cumulative Logistic Random Actual Class (success) offers regression sample offers regression sample 1 0.604373691 1 1 0.087142857 1 5.5 0.022857143 1 0.440595871 2 2 0.174285714 2 11 0.045714286 0 0.404165282 3 2 0.261428571 3 10.5 0.068571429 Each of these columns is derived as follows: O: copied from column E P: copied from column F (remember to sort columns O and P in descending order using column P) Q: consecutive integers R: =SUM(O$8:O8) S: =R$1407*Q8/Q$1407 T: copied from column Q U: =6*R8-0.5*T8 V: =6*S8-0.5*T8 To turn in: List the 7 predictor variables you identified as being most useful. Maximum profit for this logistic regression analysis occurs if we send offers to all customers with probability of purchase > 0.0759 (be careful – not 0.0751 – why?). Make a note of the number of such customers and the corresponding profit.

6. The total Charles Book Club database contains some 500,000 customers. To turn in: According to the results of the logistic regression analyses, which customers should we target to maximize expected profit, how many offers should we expect to make, and how much total profit should we expect?

7. A useful resource for reviewing statistics material can be found at www.statistics.com. To turn in: What is the definition of “data mining” at this website (under Resources > Stat Term Glossary)?