Machine Learning Based Prediction and Classification for Uplift Modeling
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Machine Learning Based Prediction and Classification for Uplift Modeling LOVISA BÖRTHAS JESSICA KRANGE SJÖLANDER KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Machine Learning Based Prediction and Classification for Uplift Modeling LOVISA BÖRTHAS JESSICA KRANGE SJÖLANDER Degree Projects in Mathematical Statistics (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020 Supervisor at KTH: Tatjana Pavlenko Examiner at KTH: Tatjana Pavlenko TRITA-SCI-GRU 2020:002 MAT-E 2020:02 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci Abstract The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in the data sets. Keywords Uplift Modeling, Data Pre-Processing, Predictive Modeling, Random Forests, Ensemble Methods, Logistic Regression, Machine Learning, Mulit-Layer Perceptron, Neural Networks. i Abstract Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna. Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression. Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan. Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat. ii Acknowledgements We would like to thank Mattias Andersson at Friends & Insights who is the key person who made this project happen to begin with. A great thanks for introducing us to the uplift modeling technique, and for suggesting our thesis project for the CRM department at the retail company. We would also like to thank Elin Thiberg at the retail company who supervised us when in need, and who gladly answered every question we had regarding the structure of the different data sets. Another person at the retail company who was supporting and guided us in the right direction was Sara Grünewald and for that we are truly grateful. Last but not least, we would like to send a great thank you to our examiner and supervisor, Professor Tatjana Pavlenko, for providing professional advise and for guiding us during our meetings. iii Contents 1 Introduction 1 1.1 Background ...................................... 1 1.2 Problem ........................................ 2 1.3 Purpose and Goal .................................. 2 1.3.1 Ethics ..................................... 2 1.4 Data .......................................... 3 1.5 Methodology ..................................... 3 1.6 Delimitations and Challenges ............................ 4 1.7 Outline ........................................ 4 2 Theoretical Background and Related Work 6 3 Data 8 3.1 Markets and Campaigns ............................... 8 3.2 Variables ....................................... 8 4 Methods and Theory 11 4.1 Data Pre-Processing ................................. 11 4.1.1 Data Cleaning ................................ 12 4.1.2 Variable Selection and Dimension Reduction . 14 4.1.3 Binning of Variables ............................. 17 4.2 Uplift Modeling .................................... 18 4.2.1 Subtraction of Two Models ......................... 19 4.2.2 Modeling Uplift Directly ........................... 19 4.2.3 Class Variable Transformation ....................... 20 4.3 Classification and Prediction ............................ 22 4.3.1 Logistic Regression ............................. 23 4.3.2 Random Forests ............................... 24 4.3.3 Neural Networks ............................... 25 4.3.4 Cross Validation ............................... 30 4.4 Evaluation ...................................... 30 4.4.1 ROC Curve .................................. 31 4.4.2 Qini Curve .................................. 32 4.5 Programming Environment of Choice ....................... 33 5 Experiments and Results 35 5.1 Data Pre-Processing ................................. 35 5.1.1 Data Cleaning ................................ 35 5.2 Uplift Modeling and Classification ......................... 38 5.2.1 Random Forests ............................... 38 5.2.2 Logistic Regression ............................. 40 5.2.3 Neural Networks ............................... 44 5.2.4 Cutoff for Classification of Customers ................... 48 6 Conclusions 49 v 6.1 Discussion ...................................... 49 6.2 Future Work ..................................... 51 6.3 Final Words ...................................... 52 References 53 vi 1 Introduction This thesis begins with a general introduction to the area for the degree project, presented in the following subsections. 1.1 Background In retail and marketing, predictive modeling is a common tool used for targeting and evaluating the response from individuals when an action is taken on. The action is normally refereed to a campaign or offer that is sent out to the customers and the response to model is the likelihood that a specific customer will act on the offer. Put differently, in traditional response models, the objective is to predict the conditional class probability P (Y = 1jX = x) where the response Y 2 f0; 1g reflects whether a customer responded positively (i.e. made a purchase) to an action or not (i.e. did not make a purchase). X = (X1; :::; Xp) are the quantitative and qualitative attributes of the customer and x is one observation. Using traditional response modeling, the resulting classifier can then be used to select what customers to target when sending out campaigns or offers in a marketing purpose. In reality, this is not always the desirable approach to use since the targeted customers are those who are most likely to react positively to the offer after the offer has been sent out. The solution is thus to use a second order approach recognized as uplift modeling. The original idea behind uplift modeling is to use two separate train sets and test sets, namely one train and test set containing a treatment group and one train and test set containing a control group. The customers in the treatment group are subject to an action whereas the customers in the control group are not. Uplift modeling thus aims at modeling