DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Machine Learning Based Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Machine Learning Based Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

Degree Projects in Mathematical Statistics (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020 Supervisor at KTH: Tatjana Pavlenko Examiner at KTH: Tatjana Pavlenko

TRITA-SCI-GRU 2020:002 MAT-E 2020:02

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in the data sets.

Keywords

Uplift Modeling, Data Pre-Processing, Predictive Modeling, Random Forests, Ensemble Methods, Logistic Regression, Machine Learning, Mulit-Layer Perceptron, Neural Networks.

i

Abstract

Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna. Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression. Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan. Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat.

ii

Acknowledgements

We would like to thank Mattias Andersson at Friends & Insights who is the key person who made this project happen to begin with. A great thanks for introducing us to the uplift modeling technique, and for suggesting our thesis project for the CRM department at the retail company. We would also like to thank Elin Thiberg at the retail company who supervised us when in need, and who gladly answered every question we had regarding the structure of the different data sets. Another person at the retail company who was supporting and guided us in the right direction was Sara Grünewald and for that we are truly grateful. Last but not least, we would like to send a great thank you to our examiner and supervisor, Professor Tatjana Pavlenko, for providing professional advise and for guiding us during our meetings.

iii

Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Problem ...... 2 1.3 Purpose and Goal ...... 2 1.3.1 Ethics ...... 2 1.4 Data ...... 3 1.5 Methodology ...... 3 1.6 Delimitations and Challenges ...... 4 1.7 Outline ...... 4

2 Theoretical Background and Related Work 6

3 Data 8 3.1 Markets and Campaigns ...... 8 3.2 Variables ...... 8

4 Methods and Theory 11 4.1 Data Pre-Processing ...... 11 4.1.1 Data Cleaning ...... 12 4.1.2 Variable Selection and Dimension Reduction ...... 14 4.1.3 Binning of Variables ...... 17 4.2 Uplift Modeling ...... 18 4.2.1 Subtraction of Two Models ...... 19 4.2.2 Modeling Uplift Directly ...... 19 4.2.3 Class Variable Transformation ...... 20 4.3 Classification and Prediction ...... 22 4.3.1 Logistic Regression ...... 23 4.3.2 Random Forests ...... 24 4.3.3 Neural Networks ...... 25 4.3.4 Cross Validation ...... 30 4.4 Evaluation ...... 30 4.4.1 ROC Curve ...... 31 4.4.2 Qini Curve ...... 32 4.5 Programming Environment of Choice ...... 33

5 Experiments and Results 35 5.1 Data Pre-Processing ...... 35 5.1.1 Data Cleaning ...... 35 5.2 Uplift Modeling and Classification ...... 38 5.2.1 Random Forests ...... 38 5.2.2 Logistic Regression ...... 40 5.2.3 Neural Networks ...... 44 5.2.4 Cutoff for Classification of Customers ...... 48

6 Conclusions 49

v 6.1 Discussion ...... 49 6.2 Future Work ...... 51 6.3 Final Words ...... 52

References 53

vi 1 Introduction

This thesis begins with a general introduction to the area for the degree project, presented in the following subsections.

1.1 Background

In retail and marketing, predictive modeling is a common tool used for targeting and evaluating the response from individuals when an action is taken on. The action is normally refereed to a campaign or offer that is sent out to the customers and the response to model is the likelihood that a specific customer will act on the offer. Put differently, in traditional response models, the objective is to predict the conditional class probability P (Y = 1|X = x) where the response Y ∈ {0, 1} reflects whether a customer responded positively (i.e. made a purchase) to an action or not (i.e. did not make a purchase). X = (X1, ..., Xp) are the quantitative and qualitative attributes of the customer and x is one observation. Using traditional response modeling, the resulting classifier can then be used to select what customers to target when sending out campaigns or offers in a marketing purpose. In reality, this is not always the desirable approach to use since the targeted customers are those who are most likely to react positively to the offer after the offer has been sent out. The solution is thus to use a second order approach recognized as uplift modeling. The original idea behind uplift modeling is to use two separate train sets and test sets, namely one train and test set containing a treatment group and one train and test set containing a control group. The customers in the treatment group are subject to an action whereas the customers in the control group are not. Uplift modeling thus aims at modeling the difference between the conditional class probabilities in the control and treatment group, instead of just modeling one class probability:

P T (Y = 1|X = x) − P C (Y = 1|X = x) (1) where the superscript T denotes the treatment group, and the superscript C denotes the control group. This method is called Subtraction of Two Models and is presented in Section 4.2.1. Each probability in (1) is estimated using the statistical machine learning methods presented in Section 1.5. If the result of (1) is negative, it indicates that the probability that a customer makes a purchase when belonging to the control group is larger than when the customer belongs to the treatment group. This is called a negative effect and is very important to include in the models for being able to investigate how the campaigns are affecting the customers, see Section 4.4.2 for more details. There also exist other approaches for uplift modeling which directly models the uplift by using one data set instead of two. This data set includes both the treatment data as well as the control data, and is split into train and test data sets. The methods for modeling the uplift directly either includes the use of a tree based method, Section 4.2.2, or to use of a class variable transformation, Section 4.2.3.

1 Using the uplift modeling approach the true gain from targeting an individual can be modeled. The purpose of using the uplift modeling approach is hence to optimize customer targeting when applying it in the marketing domain.

1.2 Problem

There is a problem that arises when using uplift modeling, i.e. using one treatment group and one control group. For every individual in the experiment, only one outcome can be observed. Either the individual belongs to the treatment group or the individual belongs to the control group. One individual can never belong to both groups. Put differently, it is not possible to know for sure that there is a causal connection that the costumer in treatment group responds because of the treatment since the same costumer cannot be in the control group at the same time. Thus it is not possible to evaluate the decisions at the individual observational unit as is possible in for example classification problems where the class of the individual is actually known. This in turn makes it a bit more tricky when evaluating uplift models. Furthermore, uplift modeling has not yet been tested on the data used in this project, or on similar data belonging to company who owns this data. Thus it is not clear if it is even possible to apply the uplift modeling technique on this data and to obtain applicable results. The question to be answered is hence how to optimize customer targeting in the marketing domain by using the uplift modeling approach, and at the same time being able to model the true gain from targeting one specific individual. Furthermore, how should the uplift modeling technique be implemented in the best way to obtain the most applicable results given this kind of data?

1.3 Purpose and Goal

The purpose of this thesis is to present methods of how to optimize the customer targeting in marketing campaigns related to the area of retail. The thesis presents investigations and discussions of different statistical machine learning methods that can be used when the aim is to estimate (1). The goal of the degree project is to present the uplift modeling approach in combination with the statistical machine learning method that yields in the best performance, given the data used in this project. The result of the project should yield in a guidance towards which approach that is best suited, and thus is the best method of choice, for analyzes that falls into the same category as for those in this project.

1.3.1 Ethics

Today there exists a lot of data on the internet that provides powerful tools when it comes to marketing and other predictions of behavior and personalities. Our goal with this project is to find the subgroup of costumers that will give the best respond to retail campaigns. This

2 task can look quite harmless on its own. Though in recent years, it has been shown that when techniques similar to this is used in other circumstances it can have serious consequences. For example [9], the company Cambridge Analytica used millions of peoples behavioural data from Facebook without their permission for the 2016 election in the USA. The data was then used to build models to find persuadable voters that could be manipulated through fake information from ads on Facebook without their knowledge. This is obviously a serous threat to the democracy and a new effective way of spreading propaganda. The uplift modeling technique was also used for the campaign for Obama in 2012 [20]. Using this technique for the campaign was acceptable in that time since the data was not illegally collected. Also, the result from the models was used to choose which people to target with campaign commercials. Hence, it is important to question for what purpose it is ethical correct to use this technique? Also, one has to question if the data that is used is acceptable to include in the models. Is it legally collected and would every person find it reasonable that their data is used for the purpose of the task? The laws regarding personal data gets toughen by time which means that different companies cannot use peoples data anyway they want. This makes it easier to draw boundaries of what kind of data that can be used when applying the uplift modeling technique. Although it is still important to always question the purpose and understand the power of the technique.

1.4 Data

The data is collected from the retail company’s database and includes qualitative and quantitative attributes about the customers. The data describes, among other things, the behaviour of different customers in terms of how many purchases has been made in different time periods, how many returns has been made as well as different amounts that the customers has spent on online purchases and in stores. There is also one binary response variable that shows whether a customer has made a purchase during a campaign period or not. Each data set used in this project corresponds to one specific campaign each, hence one customer can occur in several data sets. There are one variable that describes whether a customer belongs to the control group or the treatment group. Customers belonging to the control group are customers who did not receive any campaign offer, while customers belonging to the treatment group did receive the offer.

1.5 Methodology

Using the uplift modeling approach assumes the use of a statistical machine learning method to model predictions of the actions for individuals in the treatment group as well as individuals in the control group. There are three overall approaches that exists for uplift modeling and the first one is recognized as Subtraction of Two Models, i.e. using (1) as it is to model the difference between the class probabilities. The second approach is to model the uplift directly by using

3 a conditional divergence measure as splitting criterion in a tree-based method. The third approach is to use a Class Variable Transformation that allows for a conversion of an arbitrary probabilistic classification model into a model that predicts uplift directly. As there are advantages and disadvantages with each of the uplift modeling approaches, all of them will be examined in this project. Furthermore, each uplift modeling approach requires the use of a suitable statistical machine learning method. For both the Class Variable Transformation and Subtraction of Two Models, it is possible two use almost any statistical machine learning method that can predict conditional class probabilities. Examples of such methods are Logistic Regression, Support Vector Machines, Multilayer Perceptrons (Neural Networks), tree-based methods and K- Nearest Neighbour. For the purpose of being able to compare the model performance when using a simple model compared to a more complex model, Logistic Regression and Multilayer Perceptrons will be used in these uplift modeling settings. Using a conditional divergence measure as a splitting criterion obviously requires the use of a tree-based method. The method of choice for this approach in this project is thus the method Random Forests.

1.6 Delimitations and Challenges

The first note that needs to be made is that to be able to use uplift modeling, there is a need of an existing treatment group and control group related to a certain campaign or offer1. Not only must a control group exist, it also needs to be large enough for uplift modeling to be beneficial. The control group needs to be at least ten times larger than it needs to be when measuring simple incremental response. Also, when modeling binary outcomes, the treatment group and control group together needs to be quite large. An issue to take into consideration when performing uplift modeling is the complex customer influences. If a customer interacts with a company in several ways such as different marketing activities, advertisements, communications etc., it can be hard to isolate the effect of the specific marketing activity that is intended to model, unlike when there are fewer kinds of interactions between the customer and the company. Lastly, uplift modeling models the difference between two outcomes rather than just one outcome. Radcliffe et al. [15] points out that this leads to a higher sensitivity of overfitting the data. Thus, even methods that are normally robust to the need of variable selection and such, are in the need of it before any uplift modeling can be done.

1.7 Outline

In section 2, the idea behind uplift modeling is explained along with some related work that has already been made in the area. Section 3 contains a detailed description of the data, i.e. statistics of the different campaigns that are used and what type of variables that are collected

1Since this thesis uses uplift modeling applied to the area of retail, the action that has a related response which is modeled will be recognized as a campaign or offer throughout the whole thesis.

4 in the different data sets. The variables are listed in a table where no variable is excluded, meaning that the table contains the list of all the variables that are used before any kind of variable selection is made. The description of the data is followed by Section 4 which contains all the theory related to this thesis. Here, a theoretical description of how to pre-process data is presented along with some theory of variable selection. Furthermore, the three different approaches for uplift modeling, as well as the statistical machine learning methods that are used to perform uplift modeling is described. The three uplift modeling approaches used in this thesis are Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation. The statistical machine learning methods used for uplift modeling are Logistic Regression, Random Forests and Neural Networks. Moreover, a description of the resampling method Cross Validation is presented. The evaluation metrics that are used in this project are also presented, namely Receiver Operating Characteristic Curves and Qini curves. Finally, Section 4 is ended with the description of the programming languages that are used for the different approaches and methods, and why these languages are well suited for these kind of problems. In Section 5 all the experimental results are presented. Firstly, the results from the pre- processing of the data are presented. Secondly, each implementation is described along with tables and figures of the results of the best performing models. The report ends with Section 6 in which the conclusions of the results are discussed.

5 2 Theoretical Background and Related Work

Machine learning is an area within computer science and statistics which often aims to, given some attributes, classify a specific instance into some category, or the conditional probability that it belongs to each of the classes. This technique can be used in a lot of areas, with one of them being marketing. Though in reality, this regular kind of classification technique is not really well suited for marketing. For instance, consider a marketing campaign where an offer is sent out to a (randomly) selected subgroup of potential customers. Then using the results of the actions taken from the customers, a classifier can be built on top of it. Thus, the resulting classifier is used to select which customers to send the campaign to. The result will be that the customers who are most likely to react positively to the offer after the campaign has been sent out, will be used as targets. This is not desirable for the marketer. Some customers would have made a purchase whether or not they were targeted with the campaign, and thus unnecessary expenses are wasted in the case of sending the offer to this kind of customer. Then there are customers who actually react in a negative way by getting a campaign offer. Some might find it disturbing to receive campaign offers from the company in question, or stop being a customer out of some other reason just because they received the offer. When a customer stops doing business with a company it is called customer churn. Customer churn is something that the company in question really wants to avoid. In other words, this is not a customer the marketer wants to target since it is an unnecessary expense to send out the campaign in this case, and they needlessly looses a customer. The first kind of customer just described is called a Sure Thing and the second one is commonly mentioned as a Do-Not-Disturb. Then there are two more categories of customers, namely the Lost Cause and the Persuadable. You can tell by the name that the lost cause is someone who would react negatively, i.e. would not make any purchase at all, whether they were targeted or not. To reach out to this kind of customer is also a waste of money. The persuadable on the other hand, is the customer that the marketer wants to find and target. This kind of customer is a person who would not have made any purchase if they would not have received the campaign offer, but who would make a purchase if they did. These are the kind of customers the marketer can affect in a positive direction. An overview of the different type of customers can be seen in Table 2.1. The solution to this kind of problem is called Uplift Modeling. The original idea behind uplift modeling is to use two separate training sets, namely one data set containing the control group and one containing the treatment group. The control group contains the customers who were not targeted by the campaign, and the treatment group contains the customers who received the campaign. Uplift modeling thus aims at modeling the difference between the conditional class probabilities in the control and treatment group, instead of just modeling one class probability. Thus the true gain from targeting an individual can be modeled. A more detailed and theoretical description of uplift modeling can be seen in Section 4.2. Uplift modeling is already applied frequently in the marketing domain according to [19], although it has not received as much attention in the literature as one might believe.

6 Response No Do-Not-Disturb Lost Cause if treated Yes Sure Thing Persuadable

Yes No

Response if not treated

Table 2.1: The four categories of individuals considered when applying the uplift modeling technique.

In an article about uplift modeling in direct marketing, Rzepakowski et al. [17] uses decision trees to model the uplift for e-mail campaigns. The decision tree based models are also compared to more simple standard response based models such that three uplift models and three standard response models are used in total. The data that is modeled on reflects the costumers of a retail company. The goal is thereby to classify costumers as persuadable where the response reflects whether they go to the retail company’s website or not because of the campaign. The result of the study is that they find it possible and more effective to use uplift modeling than response models to predict the persuadables, i.e. which costumers who has a positive response to the campaigns. The standard response models were good at predicting if a costumer were going to the website or not, but performed very bad in predicting if they responded to the campaign or not. Rzepakowski et al. also show that uplift modeling done with decision trees (Modeling Uplift Directly) yields in a better result than when using Subtraction of Two Models. This is hence the reason why this project will focus solely on comparing different approaches for uplift modeling, and not to include traditional response or purchase models since they have been proven in many cases to perform worse. The same object is discussed by Radcliffe et al. [15], who write about uplift modeling and why it performs better than other traditional response methods. They also discuss thoroughly about many important aspects such as evaluation of uplift models as well as variable selection for uplift modeling, which is very helpful for getting deeper insights in the matter. It is also written about Subtraction of Two Models and why this approach does not work well compared to Modeling Uplift Directly. Radcliffe et al. indicates that it is clearly important to understand that just because Subtraction of Two Models is capable of building two good separate models that preforms well on unseen data, this does not necessarily yield in a good uplift when taking the difference of the two models.

7 3 Data

The data used in all the statistical machine learning methods in this project is collected from a well established retail company that has physical stores as well as a website where the customers can make orders online. The behaviour for the different customers can vary a lot when it comes to how a purchase is being made. Some customers only make orders online while some might only shop in a physical store. Furthermore there are customers who make purchases both online and in a store. The data used in the methods in this project will consider all kinds of purchases (both store and online). In the following subsections, the markets and campaigns used in this thesis will be presented among with a table of descriptions of all the variables.

3.1 Markets and Campaigns

The customer base can be segmented into different categories depending on the customers purchase behaviour and the company is working actively with encouraging frequent customers to do more purchases. For the purpose of not loosing a frequent customer to a silent stage, the data used in the methods in this thesis will only consider campaigns sent to frequent customers. The focus will be on this category of customers since the wish is to use uplift modeling so that campaigns will mainly be sent to customer of the type persuadable. Also, the campaigns differs depending on the stage of the customer and thus by focusing on the frequent customers, there will be a consistency when it comes to what kind of campaign that is used in the methods of this project. Today the company is present in more than 70 retail markets. One specific market is chosen for this project, and thus all the data used in the uplift models will be generated from this market. The campaigns in question that are sent to the customers are actual postcards. These postcards are sent to the customers mailbox and the postcards are only valid on one online purchase each. All the campaigns contains an offer of a 10% discount at one purchase at the company’s webshop. The only thing that differs between different campaigns are the time period of which they were sent out. In this thesis, six different campaigns that were sent out to customers in the chosen market will be considered. The campaigns and their start date among with other information can be seen in the Table 3.1.

3.2 Variables

Following is a table of all the variables used in the data set before variable selection is made. Each row is related to a different customer and each column in the data set contains all the different variables. If a variable is not specified to concern online purchases only, then it

8 Campaign CampaignStartDate AddressFileDate N T C

1 2017 − 10 − 30 2017 − 09 − 25 181 221 97.24% 2.67% 2 2018 − 02 − 05 2018 − 01 − 18 82 828 90.34% 9.66% 3 2018 − 02 − 12 2018 − 01 − 18 155 096 90.34% 9.66% 4 2018 − 04 − 02 2018 − 03 − 05 62 121 90.34% 9.66% 5 2018 − 06 − 25 2018 − 05 − 31 310 607 90.34% 9.66% 6 2019 − 03 − 04 2019 − 02 − 18 207 071 90.34% 9.66%

Table 3.1: The six different campaigns used in the uplift models. AdressFileDate is the date the customers were chosen to be a part of the campaign and N is the total number of customers in each data set. T and C are the percentage of customer that belongs to the treatment group and the control group, respectively. concerns purchases made both online and in stores. Note that the data concerning customers that has made a purchase in a store are customers who are also club members (or staff). This is because the company are not able to collect data about store customers who do not have a membership or are not staff. Purchases made online on the other hand can concern both members and non members as well as staff. All the amounts are in EUR, with the most recent currency rate. The gross amount is the amount a piece is set to cost before any kind of reduction or discount is made. If a reduction is made, i.e. if a specific item is on sale or has a new reduced price or equivalent, the new price is then the brutto amount. Discount implies that a customer has a personal discount that is used on a purchase, i.e. it is not the same as an overall reduction that is set to an item, but some kind of discount used by a specific customer. The final amount payed by the customer is then the net amount.

Variable Description 0 if customer belongs to the control group, 1 if treatment group group. Gender of customer, 1 if female, 0 if male and set to gender missing if unknown. age Age of the customer. Response flag which is 1 if customer has made a purchase resp_dd_flag online within the response window2, 0 otherwise. Number of pieces ordered online in total during response resp_dd_pieces window. resp_dd_price_net Total net price on online orders during response window. Club membership which is 1 if customer was a member at Clubmember address file date3, 0 otherwise. IsStaff 1 if customer is staff, 0 otherwise. lastP urchaseDate Date of the latest purchase in the observation period.

2The response window is from the day the campaign started plus 14 days. 3Address file date is the date the customer was chosen to be a part of the campaign.

9 1 if customer has made a purchase within the last i = Has_P urch_i 3, 12, 244 months before address file date, 0 otherwise. 1 if customer has made a purchase from children within Has_P urch_Child_i the last i = 3, 12, 24 months before address file date, 0 otherwise. 1 if customer has made a purchase from ladies within Has_P urch_Ladies_i the last i = 3, 12, 24 months before address file date, 0 otherwise. 1 if customer has made a purchase from ladies accessories Has_P urch_LadiesAcc_i within the last i = 3, 12, 24 months before address file date, 0 otherwise. 1 if customer has made a purchase from men within Has_P urch_Men_i the last i = 3, 12, 24 months before address file date, 0 otherwise. Number of orders the past i = 3, 12, 24 months before orders_i address file date. Number of orders with reduction or discount the past orders_red_or_dis_i i = 3, 12, 24 months before address file date. Number of returned orders the past i = 3, 12, 24 months orders_ret_i before address file date. Share of orders with reduction or discount the past i = share_red_or_dis_order_i 3, 12, 24 months before address file date. Share of orders with returned pieces the past i = 3, 12, 24 share_ret_order_i months before address file date. Number of pieces in total the past i = 3, 12, 24 months dd_pcs_i before address file date. Net amount in total the past i = 3, 12, 24 months before dd_net_amt_i address file date. Number of pieces with reduction or discount in total the dd_red_or_dis_pcs_i past i = 3, 12, 24 months before address file date. Number of returned pieces the past i = 3, 12, 24 months dd_ret_pcs_i before address file date. Returned net amount the past i = 3, 12, 24 months before dd_ret_net_amt_i address file date.

Table 3.2: Table of all the variables used in the data set, as well as the description for each variable. There are 54 variables in total in each data set.

4i = 3, 12, 24 indicates that there are three different variables with the same kind of information, but for 3, 12 and 24 months.

10 4 Methods and Theory

Uplift modeling is a /predictive modeling technique that directly models the incremental impact of a treatment on an individual’s behaviour. This will be the underlying model for constructing the statistical machine learning methods. There are a great amount of statistical machine learning methods that can be used for regression or classification. In this case when using a statistical machine learning method with the purpose to apply it in an uplift modeling setting, suitable models are Logistic Regression, Random Forests and Multilayer Perceptrons (Neural Networks) as these performs binary classification. The following sections will hence include the theoretical background for data pre-processing, uplift modeling, classification and evaluation metrics. Last but not least, the different programming environments of choice are presented along with some arguments of their compatibility with the data and statistical machine learning methods used in this project.

In this project the input variables are denoted as Xm ∈ {X1, ..., Xp}, also called input ”nodes” for Neural Networks, where p is the number of attributes in the data and m is an index corresponding to one variable. The response variable is denoted as Y and a prediction is denoted Y˜ and which takes on values within [0, 1]. The values represents the probability that an observation belongs to a certain class (0 or 1). A vector with all the variables is defined as X = {X1, ..., Xp}, where an observation xi is a column vector of p elements. Furthermore, referring to a matrix with N observations and p variables is denoted with a N×p bold letter X ∈ R and the response vector is denoted y = (y1, ..., yN ). One observation of T T X is then the row vector xi = (xi,1, ..., xi,p) where i = 1, ..., N.

4.1 Data Pre-Processing

The data produced nowadays is large in size and does usually have a very high dimension. Also, the data does most likely include a lot of errors such as missing values and outliers. Pre-processing data is about removing and manipulating these values so that the data is a good representation of the desired objects. Also, a part of the process may include dimension reduction when needed. The management of data can be a very challenging task since manual pre-processing of data takes a lot of time, see [8]. Moreover, the variables in the data can be in very different ranges and can have different amount of impact on the prediction. When making a predictive analysis (and other analysis as well), it is of high importance to have a data set with good representation and quality to get an acceptable result. Also, it is important to choose the variables that is best associated with the response and not to have a too large dimension of the data. To obtain this, data pre-processing is made in several ways to form a good data representation.

11 4.1.1 Data Cleaning

Cleaning the raw data is a crucial step in order to get good quality of the data representations. It is important to identify and remove incorrect and incomplete data and also, if needed, to replace and modify bad data points. In the following subsections, different ways to handle missing values and outliers will be presented.

Missing Values

It is very common that some features in a data set has missing values, and thus it is of a high importance to handle the missing data some how. To delete certain columns or rows that has a missing value is one way to handle it, but depending on what kind of missing value it is, there exists other techniques that might be more suitable for handling these values. Overall, the missing values can be divided into three different categories according to [5], namely missing at random (MAR), missing completely at random (MCAR) and missing not at random (NMAR). The missing data is MAR if for example respondents in a certain profession are less likely to report their income in a survey. The missing value thus depends on other variables than the one that is missing. If the data is said to be MCAR, then the missing value does not depend on the rest of the data. This can for example be if some questionnaires in a survey accidentally get deleted. If the missing data depends on the variable that is missing, the data is said to med NMAR. An example of this can be if respondents with high income are less likely to report their income in a survey. Having this kind of missing data causes the observed training data to give a corrupted picture of the true population. Imputation methods are in these conditions dangerous. It is possible to use imputation methods both on data that is under the assumption to be MAR as well as MCAR, although MCAR is a stronger assumption. Whether or not the data is MCAR often needs to be determined in the process of collecting the data. As mentioned before, there are several ways to handle the missing data and the simplest one is to delete the observations that contains the missing data. This method is usually called the listwise-deletion method and it is only workable if the proportion of deleted observations is small relative to the entire data set. Furthermore, it can only be used under the assumption that the missing values are MAR or MCAR. On the other hand, if the amount of missing data is large enough in size compared to the entire dataset, the method just mentioned is not good enough. In such cases it is possible to fill in an estimated value for each missing value by using a Single Imputation method such as Mean Imputation which means that the missing value is replaced with the mean of all the completely recorded values for that variable. Another way to handle missing values is to use some sophisticated algorithm such as EM-algorithm or Multiple Imputations. The latter fills the missing values m > 1 times and thus creates m different data sets which are analyzed separately and then the m results are combined to estimate the model parameters, standard errors and confident intervals. Each time the values are imputed they are generated from a distribution that might be different for each missing value, see [6].

12 In this project, the statistical software suite SAS is used for all the pre-processing of the data and hence the existing procedure MI is used for handling of some of the missing values. The MI procedure is a multiple imputation method that has a few different statements to choose from depending on what type of variables that needs to be imputed. The FCS statement is used in this project along with the imputation methods LOGISTIC (used for binary classification variables) and REG (used for the continuous variables). The FCS statement stands for the Fully Conditional Specification and it determines how the variables with an arbitrary missing data pattern are imputed, and thus the methods LOGISTIC and REG are two of the available methods related to this statement. The result after the use of the procedure yields in m separately imputed data sets with appropriate variability across the m imputations. These imputed data sets then needs to be analyzed using a standard SAS procedure which in this project is the MIXED procedure, since it is valid for a mixture of binary and continuous variables. Once the analyses from the m imputed data sets are obtained, they are combined in the MIANALYZE procedure to derive valid inferences. This procedure is described in detail in [18].

Outliers

Another important step in the process of cleaning the data is the handling of outliers. An outlier is an observation that is located in an abnormal distance from the rest of the data, i.e. the observation do not seem to fit the other data values. What is an abnormal distance or not can be decided by for example comparing the mean or median of the data from similar historical data sets, see [6]. The handling of outliers is not necessary for all kind of statistical machine learning methods as some methods are immune to the existence of outliers. In this project, the detection and handling of outliers are done for Logistic Regression and Neural Networks as these methods are sensitive to the existence of predictor outliers. Decision- trees are immune to outliers and thus outlier detection is not done as a part of the data pre- processing step for Random Forests. The method used for dealing with outliers in this project is Hidden Extrapolation. Hidden Extrapolation can be used in multivariate regression cases. The idea is to define a convex set called regressor variable hull (RVH). If an observation is outside this set, it can be confirmed to be an outlier. In Figure 4.1 it can be seen, for a two variable case, that the point (x01, x02) lies within the range of the variables X1 and X2 but not within the convex area. Hence, this observation is an outlier of the data set that is used fit the model. To determine the RVH, let us define the hat matrix

H = X(XT X)−1XT (2) where X is the N × p matrix with the data set that is used to fit the model. The diagonal elements of the hat matrix hii can be used to determine if an observation is an outlier or not. hii depends on the Euclidean distance between observation xi and the centroid. It also depends on the density of observations in RVH. The value hii that lies on the boundary of RVH is called hmax and it has the largest value of all the diagonal elements. If an observation

13 Figure 4.1: A vizualisation of the idea behind hidden extrapolation. The gray area is the ellipsoid that include all observations of the RVH. The figure is taken from [11].

xi satisfies:

T T −1 ≤ xi (X X) xi hmax (3) that observation lies within the ellipsoid that consists of all the observations in RVH. For example, if the wish is to determine whether an observation x0 is an outlier or not, h00 can simply be calculated and the result can then be checked to see if it is smaller than or equal to hmax. If the following holds:

T T −1 ≤ h00 = x0 (X X) x0 hmax the observation is not an outlier since it lies within the RVH, see [11].

4.1.2 Variable Selection and Dimension Reduction

Selection of variables is an important part in the data pre-processing step. Statistical machine learning methods are used to find relationships between the response variable and the input variables in form of a function Y = f(X) + ϵ where ϵ is an error term. If there are too many variables compared to the amount of training data, it is hard for the model to find the underlying function and then it gets overfitted. If the final model only includes variables that is truly associated with the response, the model accuracy gets improved by adding them. In reality this is usually not the case since most variables are noisy and they are not completely associated with the response. Adding many noisy variables to the model deteriorates it, and it will as a consequence perform worse on unseen data. Some statistical machine learning methods, like decision-tree learners, performs variable selection as a part of the modeling process and are thus often not in a need of variable selection. However, for statistical machine learning methods used in an uplift modeling setting, variable selection needs to be done as the difference between to outcomes is modeled and in many cases the uplift is small relative to the direct outcomes which leads to the risk of overfitting the data to increase heavily according to [15].

14 Net Information Value

A common technique for variable selection when performing uplift modeling, i.e. (1), is called the Net Information Value, NIV , which is demonstrated in [15]. The method ranks the variables and is used for every method in this project. The NIV is formed from the Weight of Evidence, WOE. Each continuous and categorical predictor is split into bins i, where i = 1, ..., G. G is the number of bins created for continuous predictors or the number of categories for categorical predictors. The predictors are thus turned into discrete predictors. For each bin i, the WOE is defined as ( ) P (Xm = i|Y = 1) WOEi = ln P (Xm = i|Y = 0) where Y ∈ {0, 1} is the label that tells whether a customer made a purchase or not and Xm is one predictor from the vector X = (X1, ..., Xp) with index m. Further, the Net Weight of Evidence NWOEi is defined as

T − C NWOEi = WOEi WOEi

T C where again denotes the treatment group and denotes the control group. Using NWOEi, the NIV for each variable in the data set can be calculated using

∑G ( T C NIV = NWOEi · P (Xm = i|Y = 1) · P (Xm = i|Y = 0)− i=1 ) T C P (Xm = i|Y = 0) · P (Xm = i|Y = 1)

The uplift package [3] in R calculates the NIV in the following way:

Algorithm 1 Net Information Value in the uplift package [3]. 1. Take B bootstrap samples and compute the NIV for each variable on each sample according to:

∑G ( T C NIV = 100 · NWOEi · P (Xm = i|Y = 1) · P (Xm = i|Y = 0)− i=1 ) T C P (Xm = i|Y = 0) · P (Xm = i|Y = 1)

2. Compute the average of the NIV (µNIV ) and the sample standard deviation of the NIV (σNIV ) for each variable over all the B bootstrap samples 3. The adjusted NIV for a given variable is computed by adding a penalty term to µNIV :

σNIV NIV = µNIV − √ B

If a variable has a high NIV it can be considered to be a good predictor. The higher the NIV

15 is for a variable, the better predictor it can be considered to be.

Variable Selection using Random Forests

Random Forests performs variable selection as a part of the modeling process and can thus be used to evaluate the Variable Importance (VI) in a data set. Random Forests is an ensemble learning method that works by constructing multiple decision trees during training, and which outputs the most commonly occurring class among the different predictions (in classification settings) or the mean prediction (in regression settings). Using decision trees, one aims at creating a model that predicts the label/target using some input variables. Decision trees consists of a tree structure with one root node which is split into two daughter nodes and where the node m represents the corresponding region Rm. The process is then repeated for all the new regions. The splitting is based on a splitting criterion based on the input variables. Put differently, the variable chosen at each step is the one that splits the region in the best manner. Using the so called Gini index in a classification tree, it is possible to get an overall summary of the VI which is an output of the Random Forests algorithm and which shows the variables that has been chosen at each split. The Gini index is thus used to evaluate the quality of each split and is defined in [5] in the following way, for each node m:

∑ ∑2 Gm = pˆmkpˆmk′ = pˆmk(1 − pˆmk) k≠ k′ k=1 where pˆmk is the proportion of the training observations from the kth class in the mth region. As the target only has two outcomes in this project, i.e. Y ∈ {0, 1}, there are only two classes k. The proportion pˆmk is defined as: 1 ∑ pˆmk = I(yi = k) Nm xi∈Rm where yi is one response observation and xi is one vector corresponding to one observation in the region Rm. The node m represents a region Rm with Nm number of observations and an observation in node m is classified according to the majority class in node m:

k(m) = arg max pˆmk k

A large VI value indicates that the variable is an important predictor, and thus it is possible to rank the variables accordingly when VI is measured using Random Forests. In this project, this method is used separately to rank the variables according to the VI and the best ranked variables are then used as input to the Random Forests method that performs uplift.

16 Dimension Reduction using Principal Component Analysis (PCA)

Dimension reduction is made using Principal Component Analysis (PCA) for both Neural Networks and Logistic Regression on non-binary variables. PCA reduces the dimension of the data into The Principle Components in the direction of where the variance is maximized. The resulting components becomes orthogonal, i.e. they become mutually uncorrelated. The following theory is taken from [21]. Lets say the original data matrix is given by X with p variables and N observations, i.e. X ∈ RN×p. A one-dimensional projection of the data, Xα with N elements, can be made using any unit-norm vector α ∈ Rp×1. The sample variance of that projection is given by equation (4) assuming the variables of X are centered and where xi are observations from X, p×1 i.e. x1, ..., xN ∈ R .

1 ∑N Vd ar(Xα) = (xT α)2 (4) N i i=1

The direction of the maximum sample variance, also called a loading vector, is given by v1 in equation (5) where (XT X)/N is the sample covariance.

{ } { T } d T X X v1 = arg max V ar(Xα) = arg max α α (5) ||α||2=1 ||α||2=1 N

The loading vector v1 is the largest eigenvalue of the sample covariance and it gives the first principle component z1 = Xv1. The next principle component is generated by calculating another vector v2 using (5) that is uncorrelated with v1. This is repeated r times and it generates the following optimization problem where the matrix Vr consists of all the optimal loading vectors.

T T Vr = arg max trace(A X XA) (6) T A:A A=Ir

The matrix A consists of the unit-norm vectors α that optimizes the problem. ”trace” is T T the sum of the diagonal elements of the resulting matrix A X XA. Vr also maximizes the total variance of the resulting components even if the loading vectors are defined sequentially.

4.1.3 Binning of Variables

Binning of variables is the procedure of converting continues variables into discrete variables. Usually, discretization of continues variables can yield in the variable to loose some information. Although in this project, binning is implemented for some variables anyways since there is an advantage of doing so for linearly dependent variables. Since SAS is used for all the pre-processing of data in this project, the HPBIN procedure will be used for the purpose of binning some variables, see [18] for more details. This procedure simply creates a data set which the binned variables gets saved in. The procedure has an

17 option called numbin, which is used to decide the number of bins, i.e. number of categories that the variables are discretized into. There exists several different binning methods and in this project the binning is done using bucket binning. Bucket binning means that evenly spaced cut points are used in the binning process. For example, if the number of bins are 3 and the continuous variable is in the range of [0, 1], the cut points are then 0.33, 0.67 and 1. Thus, the resulting discrete variable then take on values in the range of [1, 3].

4.2 Uplift Modeling

In this section, the problem formulation to the uplift modeling problem will be introduced and represented, furthermore three common approaches to the uplift problem is being discussed. To distinguish between the treatment group and the control group, notations with the superscript T will denote quantities related to the treatment group, while notations with the superscript C will denote quantities related to the control group. As an example, the probabilities in the treatment group will be denoted P T and likewise, the probabilities in the control group will be denoted P C . In addition, the notation M U will denote the resulting uplift model. The response variable takes on values as Y ∈ {0, 1} where 1 corresponds to a positive response to the treatment while 0 corresponds to a negative response. Put differently, 1 means that the individual has made a purchase while 0 means that the individual has not made a purchase. The input attributes are the same for both models, i.e. for both the model containing the treatment data as well as the model containing the control data. The definition of the expected uplift is defined as the difference between success probabilities in the treatment and control groups according to equation (1) i.e. the uplift is caused by taking the action conditional on X = (X1, ..., Xp). If the result is negative, it indicates that the probability that a customer makes a purchase when belonging to the control group is larger than when the customer belongs to the treatment group. This is called a negative effect and is very important to include in the models for being able to investigate how the campaigns are affecting the customers, see Section 4.4.2 for more details. Whether uplift modeling is an instance of a classification or regression problem is not fully clear as it can be treated as both. Uplift modeling can be viewed as a regression task when the conditional net gain (1) is treated as a numerical quantity to be measured. It can also be viewed as a classification task as the class to predict is whether a specific individual will respond positively to an action or not. Thus if the expected uplift is greater than zero for a given individual, the action should be taken on. However, as mentioned earlier it is not possible to evaluate the uplift model correctness on an individual level, see [19]. For simplicity, uplift modeling will be refereed to as a classifier throughout this thesis.

18 4.2.1 Subtraction of Two Models

When creating the algorithms for estimating equation (1) described in the introduction, there are three overall approaches that are commonly used. The first approach consists in building two separate classification models, one for the data in the treatment group, P T , and one for the data in the control group, P C . The uplift model approach Subtraction of Two Models can hence be defined as

M U = P T (Y = 1|X = x) − P C (Y = 1|X = x) which means that for each classified object, the class probabilities predicted by the model containing the data of the control group is subtracted from the class probabilities predicted by the model containing the data of the treatment group. This way, the difference in the class probabilities caused by the treatment is estimated directly (demonstrated in [7]). The input X = (X1, ..., Xp) is the same for both models but origins from two different data sets. This means that the model parameters in P T will be different from the model parameters in P C . The advantage of this approach is that it can be applied using any classification model and it is easy to estimate the uplift. The disadvantage is that this approach does not always work well in practice since the difference between two independent accurate models does not necessarily lead to an accurate model itself, see [4]. Put differently, the risk is that each model might focus too much on modeling the class in both data sets separately, instead of modeling the difference between the two class probabilities. Also, the variation in the difference between the class probabilities is usually much smaller than the variability in class probabilities themselves. This in turn can lead to an even worse accuracy, see [17]. Despite of the disadvantages just mentioned, there are some cases when this approach is competitive. According to Sołtys et al. [19], this can be either when the uplift is correlated with the class variable (e.g. when individuals that are likely to make a purchase also are likely to respond positively to an offer related to the purchase), or when the amount of training data is large enough to make a proper estimation of the conditional class probabilities in both groups. Since this approach can be applied with any classification model, and for the purpose of having a simple approach to compare with when investigating more advanced approaches, this approach will be implemented using both Logistic Regression and Neural Networks. Logistic Regression is a linear statistical machine learning method that is easy to implement while the more complex Multilayer Perceptron (MLP) is a class of feedforward artificial Neural Network. By implementing both of these methods, it is possible to analyze whether or not the more simpler method Logistic Regression performs better or worse than a more complex method of Neural Networks.

4.2.2 Modeling Uplift Directly

The second approach that is commonly used for uplift modeling is to model the uplift directly by modifying existing statistical machine learning algorithms, see [19]. The drawback of this

19 approach is hence the need of modification since the model of choice needs to be adapted to differentiate between samples belonging to the control and the treatment groups. The advantage on the other hand, is the possibility to optimize the estimation of the uplift directly. Decision trees are well suited for modeling uplift directly because of the nature of the splitting criteria in the trees. A splitting criteria is used to select the tests in nonleaf nodes of the tree. To maximize the differences between the class distributions in the control and treatment data sets, Rzepakowski et al. [16] proposes that the splitting criteria should be based on conditional distribution divergences, which is a measure of how two probability distribution differ. Put differently, using this approach, at each level of the tree the test is selected so that the divergence between the class distributions in the treatment group and control group is maximized after a split has been made. The Divergence measure used for this project is the squared Euclidean distance. Given the probabilities P = {p1, p2} and Q = {q1, q2}, the divergence is defined as

∑2 2 E(P,Q) = (pk − qk) k=1 where k is equal to 1 and 2 for binary classification like in this project, i.e the response has two classes Y ∈ {0, 1}. In this case, p1 and p2 is equal to the treatment probabilities T T C P (Y = 0) and P (Y = 1). q1 and q2 is then equal to the control probabilities P (Y = 0) and P C (Y = 1). For any divergence measure D, the proposed splitting criteria is defined in (7) and the largest value of Dgain decides the split of that node. ( ) ( ) T C T C Dgain = Dafter_split P (Y ),P (Y ) − Dbefore_split P (Y ),P (Y ) (7)

P T and P C are the class probabilities in the treatment and control group before and after the split. The resulting divergence measure after a split has been made is defined as:

( ) a ( ) ∑2 N D P T (Y ),P C (Y ) = a D P T (Y |a),P C (Y |a) (8) after_split N a=a1 where N is the number of observations before the split has been made, a ∈ {a1, a2} is the left and right leaf of that split and Na is the number of observations in each leaf after the split has been made. E.g. if the split is made out of a binary variable, A ∈ {0, 1}, the left leaf a1 corresponds to A = 0 and the right leaf a2 corresponds to A = 1 in (8). This uplift modeling approach will be implemented using decision tree learners which in this project is chosen to be the ensemble learning method Random Forests.

4.2.3 Class Variable Transformation

The third approach, likewise the one described in Section 4.2.2, models the uplift directly. Jaskowski et al. [7] proposes the introduction of a Class Variable Transformation, i.e. let us

20 ∈ { } define Z 0, 1 such that  1 if Y = 1 and T, Z = 1 if Y = 0 and C, (9) 0 otherwise. where T denotes the treatment group data and C denotes the control group data. (9) allows for the conversion of an arbitrary probabilistic classification model into a model which predicts uplift. In other words, if the customer has made a purchase, i.e. Y = 1, and belongs to the treatment group, Z is set to 1. This kind of person is then either a sure thing or a persuadable, see Table 2.1. If the customer on the other hand has not made a purchase, i.e. Y = 0, and belongs to the control group, Z is also set to 1. The customer is then either a lost cause or a persuadable. For all other cases Z is set to 0 which means that all do-not-disturb belongs to this group, i.e. there will be no risk of approaching the do-not-disturbs with a campaign. Note that this approach does not exclusively target the persuadables as would be the optimal thing to do. The reason for this is simply because one individual can never belong to both the treatment group and the control group, thus only one outcome for that individual can be observed. Therefore, it is not possible to use Class Variable Transformation to target the persuadables exclusively.

1 By assuming that T and C are independent of X = (X1, ..., Xp), and that P (C) = P (T ) = 2 , Jaskowski et al. shows that

P T (Y = 1|X = x) − P C (Y = 1|X = x) = 2P (Z = 1|X = x) − 1 which means that modeling the conditional uplift of Y is the same as modeling the conditional distribution of Z (see [7] for more details). It is thereby possible to use (9) and combine the treatment and control training data sets and then apply any standard classification method to the new data set and thus get an uplift model for Y . Jaskowski et al. also shows that the 1 assumption P (C) = P (T ) = 2 must not hold in practise. It is possible to rewrite the training data sets so that the assumption becomes valid and such a transformation does not affect the conditional class distributions. Put differently, this approach can still be beneficial in cases where there are imbalanced control and treatment groups. In this project, the campaigns are actual postcards instead of phone calls that is widely used in for example the insurance or the telecommunication business. Many uplift modeling approaches rely on the fact that it is of great importance to not target the do-not-disturbs, since this group of individuals are most probably greater when approached using actual phone calls instead of advertisement that is sent out by for example email or a text message. Hence in this project, the group of do-not-disturbs can be argued to be not as large as it might would have been if the offer instead were given using a physical phone call. Furthermore, recall from Section 3 that the share of observations belonging to the control group in each data set is relatively small compared to the share of observations in the treatment group. Considering these two facts, the transformation suggested in (9) will be slightly modified to fit this project. The first modification is to exclude all the negative samples in the control group, i.e. when Y = 0 and C. This is, as mentioned earlier, either a lost cause or a persuadable. Since

21 the control group is very small in relation to the treatment group, this modification is not expected to affect the persuadables in the uplift in a crucial manner. Furthermore, when introducing the modified Class Variable Transformation to the reduced data set, the focus will lie on only targeting the persuadables and sure things in the treatment group, namely in the following manner  1 if Y = 1 and T, Z = 0 if Y = 0 and T, (10) 0 if Y = 1 and C where once again, T and C denotes the treatment and control group data, respectively. Put differently, the resulting classification model can be defined as

M U = P Z (X = x) = 2P (Z = 1|X = x) − 1 (11) where x is an observation that could be from both the treatment group and the control group, modified using the Z transformation (10). The probability P Z can then be estimated with any classification method. As for the uplift modeling approach Subtraction of Two Models (Section 4.2.1), the Class Variable Transformation will be implemented using both Logistic Regression and Neural Networks. The aim is thus to compare the two uplift modeling approaches, as well as the two different statistical machine learning methods, to be able to conclude which uplift modeling approach and learning algorithm that is best suited for this kind of problem.

4.3 Classification and Prediction

When the pre-processing step is done, the data is ready to train models for classification. Statistical machine learning methods is used to find relationships between the response and the variables in form of a function, i.e Y = f(X) + ϵ where ϵ is some error term. In the following subsections, the three statistical machine learning methods Logistic Regression, Random Forests and Neural Networks will be described theoretically. These are the classification methods used for uplift modeling in this project, where Logistic Regression is the simplest one as it is easy to implement and it is linear. Random Forests is a type of ensemble classifiers and which is tested to see if a more complex classifier yields in a better result. Neural Networks has the ability to classify even more complex decision boundaries, hence this is the most complex method that will be tested. Logistic Regression and Neural Networks are thus used to make estimations of the probabilities from section 4.2.1 and 4.2.3, i.e. for Subtraction of Two Models:

Mˆ U = PˆT (Y = 1|X = x) − PˆC (Y = 1|X = x) (12) and for the Class Variable Transformation:

Mˆ U = PˆZ (X = x) = 2Pˆ(Z = 1|X = x) − 1 (13)

22 Random Forests is also used to estimate (12), but by using the splitting criteria described in section 4.2.2. Mˆ U is hence the estimated uplift model.

4.3.1 Logistic Regression

Logistic Regression is a so called generalized linear model and is one of the most widely- used classifiers. According to [21], when having a binary response as in this project, by using Logistic Regression one typically aims at estimating the conditional probability P (Y = 1|X = x) = E[Y |X = x] where X = (X1, ..., Xp). The linear logistic model models the log-likelihood ratio:

P (Y = 1|X = x) log = β + β T x (14) P (Y = 0|X = x) 0 p where β0 ∈ R is the intercept term, β ∈ R is the vector of regression coefficients and x is one observation. After some manipulation of (14), the following expression for the conditional probability can be obtained:

T β0+β x | e P (Y = 1 X = x) = T (15) 1 + eβ0+β x The model is fit by maximizing the binomial log-likelihood of the data which is equivalent to minimizing the negative log-likelihood. Minimization of the negative log-likelihood along with the addition of ℓ1-penalty (regularization) takes on the form: { } 1 min − L(β0,ββ; y, X) + λ||β||1 (16) β0,ββ N where y is the response vector, X is the N × p data matrix of predictors and L is the log- likelihood. Put differently, given (14), the negative log-likelihood with ℓ1-penalty can be expressed in the following way:

{ } 1 ∑N − y log P (Y = 1|x ) + (1 − y ) log P (Y = 0|x ) + λ||β|| = (17) N i i i i 1 i=1 { } ∑N 1 T − y (β + β T x ) − log(1 + eβ0+β xi ) + λ||β|| (18) N i 0 i 1 i=1 where λ ≥ 0 is a complexity parameter that controls the impact of the shrinkage, i.e. the ˆ ˆ regularization. The objective thus becomes finding the estimates β0 and β that minimizes (18).

The addition of ℓ1-penalty is a regularization∑ technique called the Lasso where the ℓ1 norm of a coefficient vector β is defined as ||β||1 = |βj| where j = 1, ..., p. Using this regularization technique yields in shrinking the coefficient estimates towards zero, and forcing some of them to become exactly zero when λ is large enough. The optimal value of λ can be obtained

23 with the contemporary use of Cross Validation (4.3.4). The usage of the Lasso requires a standardization of the predictors so that they all are in the same scale, i.e. have mean 0 and standard deviation 1. Hence, by using the Lasso, variable selection is made and a sparse model can be obtained[21].

4.3.2 Random Forests

Random Forests is a statistical machine learning method that can perform both regression and classification. The following theory is taken from [5]. This method builds many decision trees that are averaged to obtain the final prediction. The technique of averaging a statistical machine learning model is called bagging and it improves stability and avoids overfitting. Normally, decision trees are not that competitive to the best supervised learning approaches in terms of prediction accuracy since they tend to have high variance and low bias. This is because building two different decision trees can yield in two really different trees. Bagging is therefore well suited for decision tress since it reduces the variance. The idea behind Random Forests is to draw B bootstrap samples from the training data set and then build a number of different decision trees on the B different training samples. The reason why this method is called Random Forests is because it chooses random input variables before every split when building each tree. By doing this, each tree will have a reduced covariance which in turn will lower the overall variance even further. The algorithm for Random Forests for both regression and classification can be seen in Algorithm 2 (taken from [5]).

Algorithm 2 Random Forest 1. for b = 1 to B do (a) Draw a bootstrap sample Z∗ of size N from the training data with replacement. ∗ (b) Grow a random-forest tree Tb to the bootstrapped data Z , by recursively repeating the following steps for each terminal node of the tree, until the

minimum node size nmin is reached. i. Select m variables at random from the p variables. ii. Pick the best variable/split-point among the m. iii. Split the node into two daughter nodes. end B 2. Output the ensemble of trees {Tb=1} . 3. To make a prediction to a new point x:

1 ∑B Regression: fˆB (x) = T (x) rf B b b=1 ˆ Classification: Let Cb(x) be the class prediction of the bth random-forest tree: ˆB { ˆ }B Crf (x) = majority vote Cb(x) 1 .

When a split is made, a random sample of m random variables are chosen as split candidates

24 √ from the p predictors. Typically m is set to be approximately p in classification settings, or p/3 in regression settings. The reason for this is because a small value of m reduces the variance when there are many correlated variables. In the split, only one of the m variables are used and since m is small it leads to that not even a majority of the available variables are considered. Guelman et al. [4] proposes an algorithm for Random Forests for uplift modeling where the data from both the treatment and control groups are included in the training data. The uplift predictions of the individual trees should be averaged and thus an uplift can be obtained. There should be two tuning parameters, namely the number of trees in the forest as well as the number of predictions in the random subset of each node. The proposed algorithm, which is a modification of Algorithm 2, is presented in Algorithm 3. To grow a tree, a recursive binary splitting is used and thus a criterion is needed for making these binary splits. There are several different splitting criterion that are possible to use for this purpose. In the case of decision trees in an uplift modeling setting, Rzepakowski et al. [16] proposes that the splitting criteria should be based on conditional distribution divergences, recall Section 4.2.2.

Algorithm 3 Random Forest for Uplift. 1. for b = 1 to B do (a) Draw a bootstrap sample Z∗ of size N from the training data with replacement. ∗ (b) Grow an uplift decision tree UTb to the bootstrapped data Z , by recursively repeating the following steps for each terminal node of the tree, until the

minimum node size nmin is reached. i. Select m variables at random from the p variables. ii. Pick the best variable/split-point among the m. The split criterion should be based on a conditional divergence measure. iii. Split the node into two daughter nodes. end B 2. Output the ensemble of uplift trees {UTb=1} . The predicted uplift for a new data point x is obtained∑ by averaging the uplift predictions ˆB 1 B of the individual trees in the ensemble: fuplift(x) = B b=1 UTb(x).

4.3.3 Neural Networks

Artificial Neural Networks (ANN) was first developed to mimic the human brain and is widely used for artificial intelligence such as face recognition, speech recognition etc. [1] defines it as a network of units that receives inputs that represents the networks of neurons of a human brain. Neural Networks are simply nonlinear statistical models and works well for both regression and classification problems according to [5] They are usually used when the data has a high dimension with big sample size and the modeling needs to be of high complexity. This is therefore the most complex model that is implemented in this project. The Neural Network method that is chosen is the Multilayer Perceptron (MLP). The architecture consist of the input nodes, hidden layers of nodes and a layer of the output

25 nodes. An example of a two-layer MLP can be seen in Figure 4.2. The number of hidden layers and nodes determines the complexity of the model. MLP can do a nonlinear mapping of the input to the output, with hidden layers in between, using activation functions. The activation functions can be linear functions but are usually chosen to be the sigmoid function to obtain the nonlinear modeling. Following is the mathematical explanation of a two-layer MLP, i.e. a MLP with one hidden layer, originally demonstrated in [1].

Figure 4.2: The architecture of the Multilayer Perceptron with three input nodes, one hidden layer with two nodes and an output layer with two nodes. The functions f and g are the activation functions.

Assume that the input consists of p nodes (Xm, m = 1, 2, ..., p), the hidden layer t hidden nodes (Zj, j = 1, 2, ..., t) and the output layer s nodes (Yk, k = 1, 2, ...s). The weights of the connection between the input and the hidden layer are βmj and the weights of the connection between the hidden layer and the output layer are αjk. The weights also have the bias terms β0j and α0k, respectively. Furthermore, suppose that the input and hidden nodes forms the T T vectors X = (X1, ..., Xp) and Z = (Z1, ..., Zt) . Moreover, the weights forms the vectors T T β j = (β1j, ...βpj) and αk = (α1k, ...αtk). Then let Uj = β0j + X β j, and Vk = α0k + Z αk [1]. Using this, the activation functions fj(·) and gk(·) can be introduced with the following:

Zj = fj(Uj), j = 1, 2..., t ( ) ∑t ( ∑p ) µk(X) = gk(Vk) = gk α0k + αjkfj β0j + βmjXm , k = 1, 2..., s j=1 m=1

Where the activation function fj(·) stands for the hidden layer and gk(·) stands for the output layer[5]. The activation functions are usually chosen to be the sigmoid function σ(v) = 1/(1 + e−v) (Figure 4.3) and works very well for classification problems.

The kth generated output node is then µk and the true output is the same, but plus the error term ϵk.

˜ Yk = µk(X)

Yk = µk(X) + ϵk

26 Figure 4.3: An illustration of the sigmoid function that can be used as the activation function in Multilayer Perceptron. The red curve is the sigmoid function σ(v) and the dashed curves are the σ(sv) functions, where s is a scale parameter that controls the activation rate. When s = 1/2, the appearance is like the blue curve and when s = 10, the appearance is instead like the purple curve. The figure is taken from[5].

The MLP does supervised learning using the backpropagation learning rule when updating the weights. Backpropagation is an iterative gradient-decent method and which updates the weights where the derivative of the error sum of squares is at minimum. The mathematical background of backpropagation learning for a two-layer MLP is presented below and the theory can be found in [1]. Following is the error sum of squares at the kth output node where K is the set output of nodes and i is the index of each observation. 1 ∑ ( ) 1 ∑ E = Y − Y˜ 2 = e2 , i = 1, 2, ..., n i 2 i,k i,k 2 i,k k∈K k∈K

˜ The new term, ei,k = Yi,k − Yi,k, is the error signal at the kth output node. For binary ˜ ˜ classification problems, the output is only one node, i.e. Yi,k = Yi and Yi,k = Yi . The error sum of squares of the whole data set is then the average of all Ei:

1 ∑n 1 ∑n ∑ ESS = E = e2 n i 2n i,k i=1 i=1 k∈K

The algorithm is then updating the weights in the direction of where the error is minimized.

Concerning the weights αi,jk between the hidden layer and the output layer, the updating formula is as follows:

αi+1,jk = αi,jk + ∆αi,jk

∂Ei ∆αi,jk = −η ∂αi,jk where η is the learning rate and which determines the size of each step of learning. If η is too large, it might miss a local minimum of the error. On the other hand if η is too small, the computing time gets very large. Following is the derivatives of the error sum of squares using the chain rule, assuming that

27 the activation function gk(·) is differentiable:

∂E ∂E ∂e ∂Y˜ ∂V i = i · i,k · i,k · i,k ˜ ∂αi,jk ∂ei,k ∂Yi,k ∂Vi,k ∂αi,jk · − · ′ · = ei,k ( 1) µk(Xi) Zi,j − · ′ · = ei,k gk(Vi,k) Zi,j − · ′ T · = ei,k gk(αi,k0 + Zi αi,k) Zi,j

Similar holds for the update formula for the weights βi,mj between the input nodes and the hidden layer, namely:

βi+1,mj = βi,mj + ∆βi,mj

∂Ei ∆βi,mj = −η ∂βi,mj

The derivative can again be obtained using the chain rule

∂E ∂E ∂Z ∂U i = i · i,j · i,j ∂βi,mj ∂Zi,j ∂Ui,j ∂βi,mj where the derivatives are the three equations (19), (20) and (21). Note that it is again assumed that the activation functions fj(·) and gk(·) are differentiable. ∑ ∑ ∑ ∂Ei · ∂ei,k · ∂ei,k · ∂Vi,k − · ′ · = ei,k = ei,k = ei,k gk(Vij) αi,jk (19) ∂Zi,j ∂Zi,j ∂Vi,k ∂Zi,j k∈K k∈K k∈K

∂Zi,j ′ ′ T = fj(Ui,j) = fj(βi,j0 + Xi β i,j) (20) ∂Ui,j ∂Ui,j = Xi,m (21) ∂βi,mj

Putting this together, the gradient-descent updating rules for the weights αi,jk and βi,mj becomes:

− ∂Ei ′ αi+1,jk = αi,jk η = αi,jk + ηei,kgk(Vi,k)Zi,j (22) ∂αi,jk ∑ − ∂Ei ′ ′ βi+1,mj = βi,mj η = βi,mj + η ei,kgk(Vi,k)αi,jkfj(Ui,j)Xi,m (23) ∂βi,mj k∈K

The sensitivity δi,k and δi,j of the ith observation, also called local gradient, is now introduced. k is for the kth node of the output layer and j is for the jth node of the hidden layer.

δ = e g′ (V ) (24) i,k i,k k i,k∑ ′ δi,j = fj(Ui,j) δi,kαi,kj (25) k∈K

Using (24) and (25) in the equations (22) and (23) yields in the following updating formulas

28 for the weights:

αi+1,jk = αi,jk + ηδi,kZi,j (26)

βi+1,mj = βi,mj + ηδi,jXi,m (27)

The weights are initialized with random-generated uniform distributed numbers that are close to zero. The goal is to make the algorithm converge to a global minimum. If the algorithm does not converge it might be stuck at a local minimum. To solve this problem, training can be performed again but with new random weights. Although it is not always possible for the algorithm to converge. Next, training is done for some number of epochs. One epoch is when the training has been done once on the whole training set. Training can then be carried out in two different ways, namely online training or batch learning. Online training is when the weights gets updated for each observation, one at a time. Hence, the updating formulas (26) and (27) are online training with the observations i. When the updates are done for all the observations, one epoch is completed. Online training is usually better than batch learning since the learning is faster for data observations that are similar. It is also better at avoiding local minimum during training. Batch learning is when the weights gets updated simultaneously for the whole training set, i.e. once at the same time for each epoch. The updating formulas for the weights do then include the summation of the derivatives of the whole training set where n is the number of observations, and i defines each epoch.

∑n αi+1,jk = αi,jk + η δh,kzh,j (28) h=1 ∑n βi+1,mj = βi,mj + η δh,jxh,m (29) h=1

The training of the network then runs for a number of epochs so that the learning converges towards the global minimum. It is important to not let it run for too many epochs since the model then might get overfitted. An overfitted model performs very well on training data, but very poorly on unseen data (test data). Since the model is used to model the uplift, the risk of overfitting is even larger (Section 4.1.2). In order for the network to have good generalization power, overfitting needs to be avoided. A method for handling overfitting is to construct a network that is not too complex. In this project, trial and error is used to find the right number of layers and nodes. Finding the optimal number of layers and nodes is also an important issue in order for the network to find an underlying function. Furthermore, weight decay using ridge regression is another technique that is used in this project to avoid overfitting. This technique makes some weights shrink towards zero so that the complexity of the network gets adjusted[5]. The input dimension is also reduced by using PCA (4.1.2) to avoid overfitting. The input data that is used for MLP can come in many different scales. In order for PCA to work the data is standardized. After standardization, the variables have mean 0 and standard deviation 1, see [1] for more details.

29 4.3.4 Cross Validation

Cross Validation (CV) is a commonly used resampling method which is performed by repeatedly drawing samples from a training set and refitting a model of interest on each sample for obtaining additional information about the model. In this project CV is used as a part of the modeling process for obtaining the optimal value of different tuning parameters or the optimal number of variables to use in the models. When using CV, the data is split into K equal parts where one part is used as the test set while the remaining parts are used as the training set. The prediction error of the fitted model, fˆ−k(x), is then calculated using the kth set as hold out set (test set). All K parts of the data are used as hold out sets one at a time and the K resulting estimates of the prediction errors are combined and thus a CV estimate of the prediction error can be obtained. According to [5], by letting κ : {1, ..., M} 7→ {1, ..., K} be an indexing function, the CV estimate can be calculated using: 1 ∑M CV (fˆ) = L(y , fˆ−κ(i)(x )) (30) M i i i=1 where L is the prediction error associated with each fitted model. When using K = N, the method is recognized as Leave One Out Cross Validation. This means the learning procedure is fit N times, i.e. the same number of times as there are observations in the data set. Normally k-Fold Cross Validation is used, and then K is set to 5 or 10 due to lower computational cost than when using a larger number of folds and also because it yields in a good bias-variance trade-off. When the objective is to find the optimal value of a tuning parameter γ, given a set of models fˆ−1(x, γ), the formula (30) can be modified to:

1 ∑M CV (f,ˆ γ) = L(y , fˆ−κ(i)(x , γ)) (31) M i i i=1

The CV estimate in (31) yields in a test error curve and the objective hence becomes finding the value of γˆ that minimizes the CV function.

4.4 Evaluation

There are several ways to evaluate classification models, and one of the most common illustrative tools used for this purpose is the so called Receiver Operating Characteristic (ROC) curve. The ROC curve is used to evaluate performance of a classification model i.e. how well the model classifies positive individuals as positives and negative individuals as negatives. Usually in classification models, the prediction can be compared to the true answer (test target data) for each individual. An uplift model cannot have a true answer of the uplift for each individual since the same individual can never belong to both the treatment and the control group at the same time. To solve this, the evaluation metrics that is commonly used for uplift models is the Qini curve which evaluates on a population level.

30 4.4.1 ROC Curve

The result of a classification model has four different outcomes and can be explained with a confusion matrix (Figure 4.4 from [2]). True positives (TP ) is when the model classifies to positive and when the instance also is positive. This means that the model correctly classifies a positive. True negatives (TN) is when the model correctly classifies a negative. False positives (FP ) is when the model classifies to positive but the instance is negative so the model classifies to the wrong class. The same holds for false negative (FN) where the model classifies to negative but is positive. These four outcomes can be used to calculate different

Figure 4.4: A confusion matrix shows the result of a classification model. performance metrics that are used to plot the ROC curve. The true positive rate (TP rate), also called sensitivity or recall, is on the y-axis of the ROC curve and is defined as:

TP number of positives correctly classified TP rate = = P total number of positives

The false positive rate (FP rate) is on the x-axis of the ROC curve and is defined as:

FP number of negatives incorrectly classified FP rate = = N total number of negatives

Hence, the ROC curve is a plot of the probability that a positive prediction is positive against the probability that a positive prediction is negative. Other important performance metrics obtained from confusion matrices are the following

TP + TN TP accuracy = precision = P + N TP + FP

For discrete classifiers such as decision trees, the result is a point in the ROC space. This is because it produces a single class for each individual instead of a probability or score, which thus yields in a single confusion matrix as a result. A confusion matrix gives one value for the FP rate and one value for the TP rate and hence a point in the ROC space.

Probabilistic classifiers such as Neural Networks, produces a probability that an instance belongs to a certain class. In such cases a threshold is needed to get the final class prediction. Many models are using 0.5 as the default value for the threshold, but it is not always the case

31 that this value yields in the best result. One value of a threshold gives one point in the ROC space, as for the discrete classifiers. To be able to obtain the ROC curve for probabilistic classifiers, one can produce the result of a classifier using many different threshold values. As a consequence, this results in many different points, i.e. a curve, in the ROC space. The accuracy can be calculated for each threshold and the threshold with the highest accuracy is thus the best threshold.

An algorithm that is more efficient for big data sets, is instead to use the monotonicity of the threshold classification. That is, for any prediction with a certain threshold that is classified as positive will also be classified as positive for any lower threshold. However, this algorithm is not implemented from scratch in this project. Built in functions for this algorithm in R and Python are simply used instead.

Figure 4.5: Example of two ROC curves of two different models.

The diagonal line y = x in the ROC space represents a random classifier. A classifier that performs better than a random classifier will get a result in the upper triangle above the diagonal. The best model is therefore the one that has the largest distance from and above the diagonal line. Another way of comparing different models using the ROC curve is to calculate the area under the curve (AUC). The ROC space is the unit square and the values of AUC is as a consequence ∈ [0, 1]. For a classification model to perform better than a random classifier, the value of AUC needs to be greater than 0.5. The largest AUC value among different models represents the model with the best average performance. An important note is though that the model with the largest AUC is not always better than other models with lower AUC, see [2] for more details.

4.4.2 Qini Curve

The Qini curve is a good tool for comparing different uplift models. It is constructed using gains chart. It sorts the predictions from best to worst score and divides the result into segments where different amount of individuals are treated. The vertical axis of the Qini curve is the uplift or the cumulative number of incremental sales achieved, and the horizontal axis is the number of individuals treated i.e the different segments. The estimated number of incremental sales achieved per segment are calculated using

RcNt u = Rt − (32) Nc

32 For each segment, Rt and Rc are the number of individuals predicted to make a purchase, for the treatment group and the control group respectively. Further, Nt and Nc are the total number of individuals in the treatment group and control group respectively. An example of Qini curves are shown in Figure 4.6, this is demonstrated in [14]. The Purple curve has its maximum before all individuals are treated. This means that some individuals are influenced negatively by the treatment and one should thereby choose a smaller treatment group for best uplift.

Figure 4.6: An example of Qini curves where the red curve is the optimal uplift model. The blue line is a random classifier, as used in ROC curve. The purple and the green curves are two different uplift models and where the purple one is the best one in this specific case.

For each model, the Qini curve is used to decide the optimal cutoff that gives the best profit. E.g. the purple curve and its corresponding model in Figure 4.6 has the maximum profit around 60%. At that point, there is a certain cutoff which can be used to make the decision of whether or not the customer will make a purchase because of the campaign offer. Furthermore, when the model is used for a new group of costumers, that same cutoff is used to classify the persuadables, i.e. whom to target for a campaign, see [15] for more details. Another way for comparing models is to compute the Qini value. The Qini value is defined as the area between the actual incremental gains curve from the fitted model, and the area under the diagonal corresponding to a random model. A negative sign of the Qini value indicates that the result of an action is worse than doing nothing while a positive value indicates the opposite, see [3].

4.5 Programming Environment of Choice

The natural choice of software used for generation of data and handling of missing values is the software suite SAS, which is developed by SAS Institute. More specifically, SAS Studio is used. It is thereby possible to use procedures for writing SQL code, and thus to modify tables directly in the SAS software environment. For detection and deletion of outliers, the

33 software environment R is used[13]. In R it is possible to both implement statistical machine learning methods as well as performing matrix operations. Due to these facts, R is best suited to use for the outlier detection and deletion. The influential statistics can be obtained from implementation of a statistical machine learning method and the extrapolation observations can be identified using some matrix operations. There is no standard way for implementing Random Forests or Neural Networks in SAS Studio. As there exists a package in R called uplift[3] which includes several readily implementations for different uplift modeling approaches, it will be used for Modeling Uplift Directly using Random Forests. The uplift package includes the algorithm for modeling uplift with Random Forests proposed by Guelman et al. [4]. Python 3.6 is used for implementing Neural Networks as there exists suitable packages for machine learning in Python. The package used for this purpose is scikit-learn[12] and the method for Neural Networks is MLPClassier(). Logistic Regression is implemented in R using glm for the Class Variable Transformation and in Python using LogisticRegressionCV() for the Subtraction of Two Models.

34 5 Experiments and Results

In this section, the practical implementation of the project is presented and described along with the results of the different statistical machine learning methods which are presented using figures and tables.

5.1 Data Pre-Processing

Cleaning the raw data is a crucial step in order to get good quality of the data representations. The pre-processing of the data in this project is about removing och manipulating bad data values so that the data is a good representation of the desired objects. Put differently, in the following subsection the handling of missing values, outliers and binning of linearly dependent variables is presented.

5.1.1 Data Cleaning

The first step of the data cleaning process is to identify and handle missing values. Depending on the size of the missing observations as well as if they are MAR, MCAR or NMAR (recall the definitions in Section 4.1.1), the technique for handling the missing values might differ. Thus, the descriptive statistics for each data set is determined to get an overall view of the variables that has missing values. The descriptive statistics related to Campaign 1 can be found in Table 5.1. The other campaigns follows the same pattern meaning that the variables that has missing values are the same for each data set with the number of missing being greatest for gender and least for share_red_or_dis_order_3 and share_ret_order_3.

Variable NMiss Mean Min Median Max gender 16 785 0.94 0.00 1.00 1.00 age 392 35.41 15.00 34.00 89.00 Clubmember 375 0.55 0.00 1.00 1.00 IsStaff 375 0.00 0.00 0.00 1.00 lastP urchaseDate 183 2017 − 01 − 27 2015 − 09 − 25 2017 − 04 − 18 2017 − 09 − 25 share_red_or_dis_order_3 14 0.60 0.00 0.89 1.00 share_ret_order_3 14 0.50 0.00 0.50 1.00

Table 5.1: Descriptive statistics for Campaign 1. Only the variables with missing data are presented. The total number of customers in this data set is 181 221 and the campaign start date is 2017 − 10 − 30. Nmiss is the number of missing observations for each corresponding variable.

The data sets related to the first five campaigns contains variables that has a relative high number of missing data in relation to N, the total number of customers in each data set. Hence, a Multiple Imputation method would be well suited to handle these missing values. For the sake of simplicity, all data sets are treated the same way. The variables gender and age contains information that is voluntary for the customers to enter. This might be the reason for the missing data. A quick glance at Clubmember and IsStaff tells that the number of missing values are the same for these two in every data

35 set, respectively. When investigating this further, it is possible to conclude that it is the same observations that has these variables as missing. Furthermore, the varieble age is missing for the same observations as for these variables. The conclusion that can be made is that some information related to these observations are missing in some tables out of some random systematic error in the building of the data set. It can also be due to the fact that there could be some missing information in some tables related to certain districts in the market of choice. Put differently, this data is most probably MCAR. The idea is, as just mentioned, to perform Multiple Imputation for these variables if valid. To begin with, the MI procedure is used along with a statement that tells the procedure to not perform any imputations, but to print the pattern of the missing data. Using the output, it can be concluded that the missing data pattern is arbitrary, and thus the use of the FCS statement is valid. Moreover, the data is concluded to be under the MCAR assumption and hence, Multiple Imputations using the MI procedure is performed. m is set to 25 since this is the recommended number of imputations that is set by default in the procedure. Next, the imputed data sets are analyzed using the MIXED procedure and lastly, the m = 25 individual analyses are combined using the MIANALYZE procedure resulting in one pooled estimate for each parameter. The resulting parameter estimates can be seen in Table 5.2.

Campaign gender age Clubmember IsStaff 1 1 35 1 0 2 1 37 0 0 3 1 35 0 0 4 1 36 1 0 5 1 35 1 0 6 1 36 1 0

Table 5.2: Summary of the resulting parameter estimates derived using the Multiple Imputation method. Recall that the variable lastP urchaseDate is deleted for the observations that has a missing value. Moreover, share_red_or_dis_order_3 and share_ret_order_3 are set to zero when the data is missing.

The missing values for the variable lastP urchaseDate can most likely be explained with the reasoning that these customers has not made any purchase at all for the last 24 months. Recall from Section 3 that the data in this project only reflects customers in the frequent stage. Thus, the customers related to the missing values for this variable are most likely in the lost stage, but has gotten to be apart of the campaigns by mistake. This means that these values are MCAR. Considering this, these customers should be removed from the data sets as they are not suitable candidates for the campaigns and thus, risk to contribute with inappropriate influence to the models. In other words, the Listwise-Deletion method is used for this variable. Further, to be able to use the information from the variable lastP urchaseDate, another variable is added named lastP urchase. This new variable is then the difference in days between the lastP urchaseDate and AdressF ileDate5. Lastly, the variables share_red_or_dis_order_3 and share_ret_order_3 has the exact same

5Recall that adressF ileDate is the date the customers were chosen to be a part of the campaign.

36 amount of missing data in each data set respectively. The simple explanation to this is that none the customers related to the missing data of these variables has made any purchase within the 3 past months before address file date. Hence the denominator, which is the total number of purchases, in the expression for calculating the share is set to zero. The result thus becomes NULL, i.e. the value in turn is set to missing. The intuitive way to handle the missing values for these variables is thereby to replace them with a zero for every data set. The reason for this is that they actually are zero, but has been set to NULL in this case because of the zero in the denominator of the expression used for calculating the share. Moreover, concerning share_red_or_dis_order_3 and share_ret_order_3, it is advantageously to bin them in order to be able to model non-linear effects. These variables are calculated from other variables used in the data sets, and thus they are linearly dependent of those other variables. Thus, these variables as well as share_red_or_dis_order_12, share_red_or_dis_order_24, share_ret_order_12 and share_ret_order_24 are binned using the Bucket Binning method. Recall from Section 4.1.1 that bucket binning means that evenly spaced cut points are used in the binning process. In this case the number of bins are set to 5 for each variable, and thus the new values for these variables are discrete and ranges between [1, 5].

Campaign N N RF N sub N Z 1 181 221 181 038 181 035 176 867 2 82 828 82 746 82 743 76 101 3 155 096 153 345 153 343 140 178 4 62 121 62 098 62 097 57 437 5 310 607 309 300 309 294 284 276 6 207 071 207 066 189 061 191 441

Table 5.3: The resulting number of observations in each table when missing values and outliers has been handled. N is the the number of observation for each data set before any data pre-processing has been made. N RF is the resulting number of observations in each table when missing values has been handled, this is also the number of observations for the data sets used in the Modeling Uplift Directly approach. N sub and N Z are the number of observations left after missing values has been handled and outliers has been removed for the data sets used in Subtraction of Two Models and the Class Variable Transformation approach, respectively. Note that N Z is smaller than N sub since the Class Variable Transformation has been applied on the data sets related to N Z , meaning that all the negative samples in the control group has been deleted according to (10).

Once the missing values has been handled and the binning has been made, it is time for the detection and handling of outliers. There exists a risk that the Imputation Methods that was used for the handling of missing values might have replaced some of the missing values with values that can be counted as outliers. This is not the case in this project since the the imputed values are approximately the same as the mean value of the same variable. However, if this would have been the case, it would not have been a problem since these hypothetical extreme values would have been taken care of in the outlier detection. As mentioned in Section 4.2, uplift modeling can be viewed as an instance of regression as well as an instance of classification. Thus, recall from Section 4.1.1 that the method used for detecting and removing outliers in this project is Hidden Extrapolation. Further,

37 recall from Section 4.3.2 that Random Forests is immune to the effect of outliers. Hence, Hidden Extrapolation is implemented for the data sets that are used in Logistic Regression and Neural Networks only. Put differently, the data sets used in Random Forests is only pre-processed such that missing values are handled. Since Logistic Regression and Neural Networks are implemented using the same uplift modeling approaches in this project, namely Subtraction of Two Models and Class Variable Transformation, Hidden Extrapolation is applied to each data set used in these two approaches. To perform Hidden Extrapolation, each data set is fit using the method glm in R. The glm method fits generalized linear models. The influential statistics are then collected so that the hat matrix, H, and hmax can be obtained for each data set, recall (2) and (3). For each data set, once hmax is obtained, the extrapolating observations can be identified and removed. The reduced data sets as a result of the handling of missing values and outliers can be seen in Table 5.3. In this project the variable selection will be handled in different ways for the different statistical machine learning methods and is presented in each section for the different methods in Section 5.2.

5.2 Uplift Modeling and Classification

In the following subsections different variable selection methods are presented along with how the different statistical machine learning methods are built and examined. The resulting models with the best model performances are also presented.

5.2.1 Random Forests

Uplift for Random Forest is implemented in R using the uplift package which has a readily implementation for Random Forest in an uplift modeling setting. However, before the implementation is done, all the variables are converted to the right data types, i.e. categorical variables are converted to factors, integers to integers and decimals to numeric. As mentioned in Section 4.1.2, even though decision-tree learners performs variable selection as a part of the modeling process, variable selection is an important part of the process for decision-tree learners used in an uplift modeling setting. This is because it models the difference between the outcomes of two models and thus easily overfits the data. This means that variable selection is a critical step in the modeling process for uplift for Random Forest and two different methods are tested to obtain the best model for each data set. The first method applied to each data set is to use Variable Importance (VI) to rank the variables according to their importance along with Cross Validation for Random Forest to obtain the optimal number of variables to use in the model. The optimal number of variables, i.e. the number of variables that yields in the lowest error rate, is presented in Table 5.4. It is worth noting that the Cross Validation function for Random Forest in R is not testing every number of variables due to high computational cost. Hence, by looking at the two lowest errors obtained using Cross Validation, it is possible to conclude the approximate number of

38 Campaign Variables Lowest error Variables Second lowest error 1 4 1.547 · 10−08 7 4.640 · 10−08 2 4 5.693 · 10−27 7 2.900 · 10−08 3 4 5.217 · 10−9 7 9.140 · 10−8 4 4 4.312 · 10−27 7 2.319 · 10−07 5 4 5.554 · 10−26 7 3.621 · 10−8 6 4 1.333 · 10−7 7 2.690 · 10−7

Table 5.4: The optimal number of variables to use for each data set for random forests. This is the number of variables that yields in the lowest error rate according to Cross Validation applied to Random Forest. Moreover, the second lowest error and corresponding number of variables are shown. variables that yields in the lowest error. Using the result in Table 5.4, different number of variables are tested and the variables used are those with the largest VI. The second variable selection method applied to each data set is to use the Adjusted Net Information Value (NIV ) to be able to rank the variables accordingly. Recall from Section 4.1.2 that depending on the NIV for a variable, the strength as a predictor will vary. Once the NIV s are calculated for each data set, different number of variables are tested to obtain the best model performance. The variables that are used are always those with the largest NIV .

Campaign Qini using VI Qini using NIV 1 0.0058 0.0077 2 0.0020 0.0030 3 0.0002 0.0015 4 0.0030 0.0035 5 0.0014 0.0022 6 −0.0010 0.0020

Table 5.5: The largest Qini value obtained for the different data sets when using NIV and VI as variable selection method in random forests, respectively.

Once the variable selection is made, different values of the tuning parameters in the Random Forest has to be examined to obtain the optimal model performance. Good model performance is recognized as a Qini curve with an uplift that is greater when treating a subgroup of the population compared to when treating the entire population (Section 4.4.2). Also, a positive Qini value is preferred as a negative one indicates that the result of an action is worse than doing nothing. The resulting Qini values for the the optimal models performing on the different data sets can be seen in Table 5.5. In this project, the splitting criterion in Random Forests is based on the squared Euclidean distance, recall Section 4.2.2. Resulting Qini curves from modeling on each data set can be seen in Figure 5.1.

39 Campaign 1 Campaign 2 0.025 0.030 0.020 0.015 0.020 0.010 0.010 0.005 Cumulative incremental gains (pc pt) Cumulative incremental gains (pc pt) Cumulative 0.000 0.000 0 20 40 60 80 100 0 20 40 60 80 100

Proportion of population targeted (%) Proportion of population targeted (%)

Campaign 3 Campaign 4 0.020 0.030 0.025 0.015 0.020 0.010 0.015 0.010 0.005 0.005 Cumulative incremental gains (pc pt) Cumulative incremental gains (pc pt) Cumulative 0.000 0.000 0 20 40 60 80 100 0 20 40 60 80 100

Proportion of population targeted (%) Proportion of population targeted (%)

Campaign 5 Campaign 6 0.030 0.015 0.025 0.020 0.010 0.015 0.010 0.005 0.005 Cumulative incremental gains (pc pt) Cumulative incremental gains (pc pt) Cumulative 0.000 0.000 0 20 40 60 80 100 0 20 40 60 80 100

Proportion of population targeted (%) Proportion of population targeted (%)

Figure 5.1: Resulting Qini curves for each data set when using Random Forests. 20 segments has been used when evaluating the result. Resulting curves using NIV as variable selection method can be seen in blue, while the resulting curves using VI can be seen in red. The black line represents a random classifier, and the black point at each curve points out the largest value for each curve, respectively.

5.2.2 Logistic Regression

Uplift modeling is implemented in two ways when using Logistic Regression. The first one is Subtraction of Two Models and the second one is the Class Variable Transformation. Subtraction of Two Models is implemented in Python while the Class Variable Transformation likewise Random Forests, is implemented in R using the Uplift[13] package among others. The implementation process is described in the two following subsections along with the corresponding results for each data set.

Subtraction of Two Models

Two different ways of performing variable selection is applied when using this approach, i.e NIV and the Lasso regularization. The reason of this is because Subtraction of Two Models was proven to be difficult to implement for obtaining applicable results, and also since variable selection is a crucial step in the modeling process for uplift models. Thus, the first variable selection method applied to each data set is NIV . The NIV is calculated and

40 every variable is ranked accordingly. Next, each control data set is split into training and test with an equal amount of control and treatment data in the test data set for being able to plot the Qini curves later on. Having a small number of observations from the control group in the test data set results in some segments to have zero observations from the control data group when calculating the incremental gains. The incremental gains thus becomes undefined for those segments, recall equation (32) in Section 4.4.2. Hence, only one test data set containing an equal amount of control and treatment data is used. For the purpose of reducing the number of linearly dependent variables, Principle Component Analysis (PCA) is performed on both training and test sets so that the dimensions are reduced. Next, two models are built, one for the treatment data set and one for the control data set. The models are built using Logistic Regression and the regularization method the Lasso which performs variable selection as it shrinks some coefficients towards zero (and some becomes exactly zero). The optimal value of the penalty term is obtained using Cross Validation. The number of variables left after variable selection (NIV ) has been made is 27 for campaign 1, where 4 of them are binary and the rest is continuous. The continuous variables are reduced into Principle Components. The result is shown for 11 and 13 Principle components. The final dimensions are then 15 and 17 for campaign 1. The results for campaign 1 are shown in the figures 5.2 and 5.3.

Figure 5.2: ROC curve (left) and Qini curve (right) for campaign 1 using 11 Principle Components. The red curve is the model for the control group data and blue curve is the model for the treatment group data. The AUC is 0.7518 and 0.6945 for the red and the blue curve respectively.

The number of variables left after variable selection has been made for campaign 2 is 26, where 3 of them are binary and the rest is continuous. The result is shown for 6 and 10 Principle components. The final dimensions are then 9 and 13 for campaign 2. The results for campaign 2 is shown in the figures 5.4 and 5.5. Likewise the results for campaign 1 and 2, the results for the rest of the campaigns yields in bad uplifts. Hence, only the results for campaign 1 and 2 are presented here.

41 Figure 5.3: ROC curve (left) and Qini curve (right) for campaign 1 using 13 Principle Components. The red curve is the model for the control group data and blue curve is the model for the treatment group data. The AUC is 0.7101 and 0.6792 for the red and the blue curve respectively.

Figure 5.4: ROC curve (left) and Qini curve (right) for campaign 2 using 6 Principle Components. The red curve is the model for the control group data and blue curve is the model for the treatment group data. The AUC is 0.9143 and 0.7592 for the red and the blue curve respectively.

Figure 5.5: ROC curve (left) and Qini curve (right) for campaign 2 using 10 Principle Components. The red curve is the model for the control group data and blue curve is the model for the treatment group data. The AUC is 0.7592 and 0.7273 for the red and the blue curve respectively.

Class Variable Transformation

Before any modeling is done, the variable selection method chosen for the Class Variable Transformation is the NIV . After choosing the best variables according to the NIV , the data sets are split into training and test with the same amount of data from both the control group and the treatment group in respective test data set. Next, the Class Variable Transformation is applied to both the training and test data sets according to (10). After the transformation is completed, resampling is used to increase the

42 Campaign Qini for Z AUC for Z 1 0.0643 0.8619 2 0.0611 0.8642 3 0.0448 0.8872 4 0.0221 0.6054 5 0.0232 0.6520 6 0.0669 0.8292

Table 5.6: The largest Qini value obtained for the different data sets when using Logistic Regression and the Class Variable Transformation. share of the control observations in the training data sets. Next each training data set is fit with Logistic Regression using Z as response variable. Different number of variables used as predictors are evaluated in order to obtain the best performing model.

Campaign 1 Campaign 2 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 Cumulative incremental gains (pc pt) Cumulative incremental gains (pc pt) Cumulative 0.00 0.00

0 20 40 60 80 100 0 20 40 60 80 100 Proportion of population targeted (%) Proportion of population targeted (%)

Campaign 3 Campaign 4 0.25 0.12 0.20 0.10 0.08 0.15 0.06 0.10 0.04 Cumulative incremental gains (pc pt) Cumulative incremental gains (pc pt) Cumulative 0.05 0.02 0.00 0.00

0 20 40 60 80 100 0 20 40 60 80 100 Proportion of population targeted (%) Proportion of population targeted (%)

Campaign 5 Campaign 6 0.25 0.15 0.20 0.15 0.10 0.10 0.05 0.05 Cumulative incremental gains (pc pt) Cumulative incremental gains (pc pt) Cumulative 0.00 0.00

0 20 40 60 80 100 0 20 40 60 80 100 Proportion of population targeted (%) Proportion of population targeted (%)

Figure 5.6: Resulting Qini curves for each data set when using Logistic Regression and the Class Variable Transformation. 20 segments has been used when evaluating the result. The black line represents a random classifier and the black point at each curve points out the largest value for each curve, respectively.

Resulting Qini and AUC values can be seen in Table 5.6 and the resulting Qini curves and ROC curves for the Class Variable Transformation can be seen in Figure 5.6 and Figure 5.7, respectively.

43 ROC curves: class variable transformation 1.2 1.0 0.8 0.6 Sensitivity 0.4 Random Campaign 1 Campaign 2 Campaign 3 0.2 Campaign 4 Campaign 5 Campaign 6 0.0 −0.2

1.0 0.8 0.6 0.4 0.2 0.0 Specificity

Figure 5.7: Resulting ROC curves for each data set when using Logistic Regression and the Class Variable Transformation.

5.2.3 Neural Networks

The results from using the uplift modeling approaches Subtraction of Two Models and the Class Variable Transformation along with Multilayer Perceptron (MLP) is presented in the following subsections.

Subtraction of Two Models

Using this approach, two models are trained to fit the control data and the treatment data separately using a MLP network. As for Logistic Regression, the resulting models are tested towards the same test data set which contains an equal amount of control data and treatment data. The performance of the two models are then visualized with two ROC curves. The resulting estimation of the uplift is calculated according to Subtraction of Two Models in 4.2.2 and visualized with the Qini curve. The method gives similar results for the 6 campaigns. It finds good models for the control and treatment group separately but fails to find any uplift. Hence, only results for campaign 1 and 2 are presented with different amounts of Principle Components. For campaign 1 the number of variables is 27 after the variable selection method NIV has been applied. There are 4 binary variables and the rest is continuous. The result is shown for 8 and 13 components but the models yields in similar results for other number of components as well. The final dimension is then 12 and 17 for campaign 1, see Figures 5.8 and 5.9. Campaign 2 has 26 variables left after the variable selection step. The resulting data set has 3 binary variables and the rest are continuous. The result is again shown for 8 and 13

44 Figure 5.8: ROC curve (left) and Qini curve (right) for campaign 1 using 8 principle components. The red curve is the model built on the control group data and the blue curve is the model built on the treatment group data. The AUC is 0.7187 and 0.7176 for the red and the blue curve respectively.

Figure 5.9: ROC curve (left) and Qini curve (right) for campaign 1 using 13 principle components. The red curve is the model built on the control group data and the blue curve is the model built on the treatment group data. The AUC is 0.7388 and 0.7758 for the red and the blue curve respectively.

Principle Components. The final dimensions for campaign 2 are then 11 and 16 respectively, see Figures 5.10 and 5.11.

Figure 5.10: ROC curve (left) and Qini curve (right) for campaign 2 using 8 principle components. The red curve is the model built on the control group data and the blue curve is the model built on the treatment group data. The AUC is 0.7826 and 0.6825 for the red and the blue curve respectively.

It is important to clarify that the resulting data sets for the treatment and control training data are not in the same size due to the treatment data being much larger than the control data, recall the share of treatment and control data in Table 3.1, Section 3. The model trained on the control data has as a consequence a larger risk of getting overfitted because of the small data

45 Figure 5.11: ROC curve (left) and Qini curve (right) for campaign 2 using 13 principle components. The red curve is the model built on the control group data and the blue curve is the model built on the treatment group data. The AUC is 0.8066 and 0.7081 for the red and the blue curve respectively. size and also because Neural Networks is a more complex method than the other methods used in this project. This is adjusted with different values of the regularization parameter, i.e. Ridge Regression. The models are also regularized by determine the number of layers and nodes for each data set.

Class Variable Transformation

Resampling is made in this method as well, but by using a built in method in the R Uplift package [3]. This method resamples the data in a manner that is well suited for uplift modeling when using the Class Variable Transformation. Overall, it is difficult to resample the data in a way such that the class distribution is balanced and at the same time having an acceptable amount of control and treatment data, hence this built in method is used. Figure 5.12 shows the resulting Qini curves for each campaign and respective ROC curve is presented in Figure 5.13.

Campaign Qini AUC Components(PCA) Binary Var. 1 0.0202 0.5503 8 6 2 0.0251 0.5555 12 4 3 0.0268 0.7934 9 5 4 0.0264 0.8024 8 5 5 0.0039 0.5791 8 4 6 0.0133 0.7052 8 5

Table 5.7: All the important values for each campaign. The Qini value, AUC, number of Principle Components and number of binary variables.

46 Figure 5.12: The Qini curves for Campaign 1, 2 and 3 are on the first row and for campaign 4, 5 and 6 on the second row. The results are from using the Class Variable Transformation along with Neural Networks.

Figure 5.13: The ROC curves for Campaign 1, 2 and 3 are on the first row and for campaign 4, 5 and 6 on the second row. The results are from using the Class Variable Transformation along with Neural Networks.

47 5.2.4 Cutoff for Classification of Customers

The best performing models are used to obtain the cutoff of how to classify costumers, i.e. deciding what customers to send the campaigns to. For example, if the cutoff is 0.55, a costumer is classified to be a persuadable if the predicted probability ≥ 0.55 that the costumer is a persuadable. The uplift is calculated with respective cutoff for Random Forest, Table 5.8, the Class Variable Transformation using Logistic Regression, Table 5.9, and Neural Networks 5.10. The models with good uplift result are included in this section only. The corresponding percentage of the total population (in the test data set) to target is included in the tables as well.

Campaign Cutoff Incremental Gains Percentage of Treated 1 0.1755 0.0277 50% 2 0.1938 0.0157 55% 3 0.1287 0.0164 60% 4 0.2399 0.0282 80% 5 0.1975 0.0137 70% 6 0.3356 0.0264 75%

Table 5.8: The chosen cutoffs for Random Forest.

Campaign Cutoff Incremental Gains Percentage of Treated 1 0.0730 0.1821 40% 2 0.0721 0.0771 40% 3 0.0545 0.1247 35% 4 0.1080 0.2383 50%

Table 5.9: The chosen cutoffs for Class Variable Transformation using Logistic Regression.

Campaign Cutoff Incremental Gains Percentage of Treated 3 0.0387 0.1284 75%

Table 5.10: The chosen cutoff for Class Variable Transformation using Neural Network.

The chosen cutoffs are not necessarily the ones that has the largest incremental gains. In some cases, the largest incremental gain is found when the whole population is targeted, recall the black points in Figures 5.1 and 5.6. Thus, cutoffs with relatively good incremental gains are chosen for the purpose of not targeting the whole population.

48 6 Conclusions

The overall conclusion is that for all data sets, each model required plenty of tries for being able to capture the model parameters that yielded in the best model performances. Generally speaking, the models performed poorly although some models were able to obtain satisfying results. In the following sections, the results will be discussed and suggestions for future studies will be presented.

6.1 Discussion

When building the uplift models using Random Forests, it was not always possible to obtain models that performed better than a random classifier which means that the result of an action is worse than doing nothing in those cases. This was the case for some data sets when using VI as variable selection method. Overall, the NIV performed better as variable selection method for Random Forests than the VI combined with Cross Validation for Random Forests did (Figure 5.1). For some data sets it was easy to capture great performing models using VI, but for most it was not. The NIV on the other hand, was able to capture good performing models for all data sets. Furthermore, by observing Table 5.5 it can be seen that the Qini value is greater for every data set when using NIV as variable selection method. The issue that arises when using Random Forests and the Gini index as splitting criterion, i.e. using VI as in this project, is that the algorithm tend to favor categorical predictors with many categories which in turn can lead to that the model gets overfitted. Therefore, such predictors should be avoided according to [5]. Since there are a great amount of categorical predictors used as input to the Random Forests in this project, this is possibly the reason why selecting variables according to VI did not yield in as good model performance as selecting them according to NIV . Hence, NIV is better suited as variable selection method than VI using the Gini index as splitting criterion for Random Forests, when the purpose is to apply it in an uplift modeling setting. Random Forests is the only statistical machine learning method used in this project that was able to capture applicable models without having to resample the training data set to increase the amount of control data when training the models. One reason for this is due to one of the parameters in the upliftRF() method in the uplift [3] package in R that defines the minimum number of control observations that must exist in any terminal node. Hence, during the building process, the tree is forced to contain control data in every region. This way, the Qini curves were obtained without any obstacles. Looking at the results from Subtraction of Two Models, both Logistic Regression and Neural Networks were able to capture good models for treatment data and control data separately. This can be seen in the Roc curves, figures 5.2 or 5.8, where the curves are high above the diagonal line and the values of AUC are very good, i.e. ≥ 0.5. This means that the models performs well on unseen data (test data). However, this does not always correspond to a good uplift, see Qini curves in 5.2 or 5.8. The same results are obtained in the article [17]. When training the models separately it is not certain that the models predict a large difference in probability. I.e. if it was possible to train the models in relation to each other the result

49 could give a better uplift. The result can be seen in respective Qini curve where the uplift for Logistic Regression and Neural Networks becomes negative. This means that the models predicts a negative gain of the treatment, i.e the campaign always makes costumer buy less which is highly unlikely. Logistic Regression has satisfying results for the Class variable Transformation. This conclusion can be made since the Qini curves are above the diagonal line and the incremental gains are positive. Also, the AUC ≥ 0.5 which means the models performs better than a random classifier. The results for Neural Networks are also acceptable looking at both the Qini curve and the ROC curve. Although, Logistic Regression gives larger uplift for a smaller amount of costumers targeted for campaign 1 to 4. Also, the ROC curves shows that Logistic Regression performs better. The main problem for Neural Networks is that it needs balanced classes to perform well. When doing the Class Variable Transformation described in section 4.2.3, it is not easy to obtain training data with balanced classes. This is because the share of the control group was only 9.66% in most of the campaigns. By applying the transformation the amount of the control data becomes even smaller. The data in the test data set needs to have an equal amount of control and treatment data and by having this, the minority class is becoming even smaller in the training data set. Thus, over-sampling was used to overcome the issue with having a small minority class but with such a solution another issue arises, i.e. re-sampling can result in that the model gets overfitted. The resulting Qini curves for Class Variable Transformation shows that the maximum incremental gain is obtained by treating all the costumers. This means that there are no negative effects of the treatments, i.e there are no costumers that refrain to make a purchase just because they got the campaign. Moreover, by looking at campaign 1, 2, 3 and 4 for Logistic Regression, figure 5.6, it can be seen that the incremental gains are near its maximum when approximately 50% of the population are targeted. This means that only 50% of the costumers in the test data can receive the campaign in order for the company to obtain almost the same profit as if the entire population were targeted. Put differently, the campaigns does not need to be sent to the remaining 50% of the costumers as their purchase behaviour will be similar no matter if they received a campaign offer or not. The final results are the cutoffs obtained from the best performing models for Random Forests as well as Logistic Regression and Neural Networks along with the Class Variable Transformation. The cutoff is decided based on the Qini curves. In Figures 5.1 and 5.6 the maximum value of the incremental gains for each model is marked with a black point in each graph. For example, by looking at campaign 2 in figure 5.6, one can see that the incremental gain is almost the same at 40% as at 100%. Hence, the cutoff can the be chosen at that point so that the campaigns will only be sent out to a subgroup of the entire population. Put differently, the cutoff is used when a new campaign will be sent out to costumers and a decision of which costumers to target has to be made. The test data of the costumers are simply sent to the trained model where the output is probabilities of which class they belong to. The cutoff is then used to make the final classifications, i.e the decision of whom will get the campaign.

50 6.2 Future Work

The market evaluated in this project was one country. It would have been interesting to evaluate whether different markets differ from each other. It might be that customers in different markets react differently to marketing campaigns, and even differently to different kind of campaigns. Thus, possible future work could be to look into if the marketing campaign/offer should be of a different kind depending on what market is targeted. This could lead to happier and more loyal customers as well as an uplift for the company in terms of a greater gain in the selling. Furthermore, the segment of customers that is evaluated in this project is the frequent kind, i.e. the company’s most loyal customer. Depending on the stage of the customer, the offers might vary. One question to think of is whether the best offer should be offered to the most loyal customer to keep their interest, or if the best offer should be offered to a new customer or even the least loyal customer. A new customer might need a reason to gain trust in the company, while the least loyal customer needs a reason to start interacting more frequently with the company. Hence, to also investigate the other stages of customers in different segments of the customer base might lead to insights in how to interact with different type of customers. As mentioned throughout this thesis, the sizes of the treatment and control group is a crucial matter to be able to perform uplift modeling in a satisfying way, and to get an appealing result. For uplift modeling to actually be beneficial, when sending out campaign offers in the future, the share of the control group should be larger than just 9.66% as was the share in most of the campaigns in this project. This is especially important for being able to perform the Class Variable Transformation as some of the observations in the control group gets excluded in the final data set. Using a larger control group in future campaigns could hence yield in a better predictive model when performing uplift modeling. Once the persuadables are identified using uplift modeling, the company gets an indicator of what customers to target with campaign offers. The only known attribute that is equal for these customers is that they can be considered to belong to the group of persuadables, but nothing more than that. To investigate this segment of customers further would be interesting, and could yield in deeper insights about the customer base. For this purpose, unsupervised learning approaches like clustering methods could be useful. Using a clustering method to investigate this group of customers could give insights into what other attributes are similar for these individuals. Knowing attributes that is similar for these individuals could give the tool to personalize the campaign offers. This could lead to even more loyal customers and thus a gain for the company. As noted earlier in this project, the variable selection is a crucial component of the data pre- processing part and which is an area that can be improved when it comes to Random Forests. When using VI and Cross Validation for classification, the default quality measure for node impurity when using the randomForest package [10] in R is the Gini index. Radcliffe et al. [15] proposes that the quality measure should be based on a pessimistic qini estimate. This is supposed to reduce the likelihood of choosing variables that leads to unstable models. Hence, this could be an improvement to apply in future studies. Another investigation to add to the modeling with Random Forests is to test other splitting criterions than the squared Euclidean distance. Rzepakowski et al. [16] proposes

51 that the splitting criterion can be based on the Kullback-Leibler divergence and chi- squared divergence as well. This might not lead to better performances, but is worth considering.

6.3 Final Words

The overall conclusion is that, given the data related to the different campaigns in this project, it is possible to perform uplift modeling to obtain models that makes it possible to comprehend how to target only a subgroup of the entire customer base instead of targeting the whole customer base with campaign offers. Doing this, the retail company still receives an incremental gain. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly approach using Random Forests, or the Class Variable Transformation using Logistic Regression. This is due to the fact that Neural Networks are sensitive to uneven class distributions and are thus not able to obtain stable models given the data in this project. Moreover, Subtraction of Two Models was proven to not yield in applicable results as the two separate models of the treatment data and the control data did not result in satisfying models when combining the two. The variable selection was proven to be a crucial part of the modeling process and thus a lot of focus should be put on this step when building the models. Overall, using the NIV as variable selection method yielded in good performances. Another crucial component related to this project was the amount of the treatment and the control data in each campaign. Having a larger amount of control data in future studies would yield in even better performing and stable models so that resampling can be avoided, and the risk of overfitting the models can be decreased. In total, given the data sets used in this project and the market of choice, the uplift approach is working given the right circumstances, and thus it can yield in a gain for the retail company to start using it.

52 References

[1] Casella, G., Fienberg, S., and Olkin, I. Manifold Modern Multivariate Statistical Techniques. Artificial Neural Networks. Springer, 2008. DOI: 10.1007/978-0-387- 78189-1. [2] Fawcett, Tom. “An Introduction to ROC analysis”. In: Pattern Recognition Letters 27.2 (June 2006), pp. 861–874. URL: https://www.sciencedirect.com/science/ article/abs/pii/S016786550500303X. [3] Guelman, Leo. “uplift: Uplift Modeling”. In: (2014). R package version 0.3.5. URL: https://CRAN.R-project.org/package=uplift. [4] Guelman, Leo, Guillén, Montserrat, and Pérez-Marín, Ana M. “Random Forests for Uplift Modeling: An Insurance Customer Retention Case”. In: (2012). Ed. by Kurt J. Engemann, Anna M. Gil-Lafuente, and José M. Merigó, pp. 123–133. [5] Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Data Mining, Inference and Prediction. 2nd ed. Springer Series in Statistics. Springer, 2008. [6] Izenman, Alan J. Modern Multivariate Statistial Techniques. Regression, Classification and Manifold Learning. Springer Series in Statistics. Springer, 2008. [7] Jaśkowski, Maciej and Jaroszewicz, Szymon. “Uplift modeling for clinical trial data”. In: ICML Workshop on machine learning for clinical data analysis (2012). [8] Jiliang Tang Salem Alelyani, Huan Liu. “Feature Selection for Classification: A Review”. In: Arizona state university 2014 (Jan. 2014), p. 1. [9] KOZLOWSKA, IGA. “Facebook and Data Privacy in the Age of Cambridge Analytica”. In: The Henry M. Jackson School of International Studies, University of Washington (2018). URL: https://jsis.washington.edu/news/facebook-data-privacy-age- cambridge-analytica/. [10] Liaw, Andy and Wiener, Matthew. “Classification and Regression by randomForest”. In: R News 2.3 (2002), pp. 18–22. URL: https://CRAN.R-project.org/doc/Rnews/. [11] Montgomery, Douglas C., Peck, Elizabeth A., and Vining, G. Geoffrey. Introduction to Linear . Fifth Edition. Wiley Series in Probability and Statistics. Wiley, 2012. [12] Pedregosa, F. et al. “Scikit-learn: Machine Learning in Python ”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [13] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing. Vienna, Austria, 2019. URL: https://www.R- project.org/. [14] Radcliffe, Nicholas J. “Using control groups to target on predicted lift: Building and assessing uplift model”. In: Semantic Scholar (2007), pp. 4–7. URL: https : / / www . semanticscholar . org / paper / Using - control - groups - to - target - on - predicted - lift % 3A - Radcliffe / 147b32f3d56566c8654a9999c5477dded233328e ? citingPapersSort=is-influential#citing-papers.

53 [15] Radcliffe, Nicholas J. and Surry, Patrick D. “Real-World Uplift Modelling with Significance-Based Uplift Trees”. In: 2012. [16] Rzepakowski, Piotr and Jaroszewicz, Szymon. “Decision trees for uplift modeling with single and multiple treatments”. In: Knowledge and Information Systems 32.2 (Aug. 2012), pp. 303–327. ISSN: 0219-3116. DOI: 10 . 1007 / s10115 - 011 - 0434 - 0. URL: https://doi.org/10.1007/s10115-011-0434-0. [17] Rzepakowski, Piotr and Jaroszewicz, Szymon. “Uplift modeling in direct marketing”. In: Journal of Telecommunications and Information Technology 2012 (Jan. 2012), pp. 43–50. [18] “SAS/STAT. 13.1 User’s Guide”. In: SAS Institute Inc (2013). [19] Sołtys, Michał, Jaroszewicz, Szymon, and Rzepakowski, Piotr. “Ensemble methods for uplift modeling”. In: Data Mining and Knowledge Discovery 29.6 (Nov. 2015), pp. 1531–1559. ISSN: 1573-756X. DOI: 10.1007/s10618-014-0383-9. URL: https: //doi.org/10.1007/s10618-014-0383-9. [20] Stedman, Craig. “How uplift modeling helped Obama’s campaign — and can aid marketers”. In: Predictive Analytics Times (2013). URL: https : / / www . predictiveanalyticsworld.com/patimes/how-uplift-modeling-helped-obamas- campaign-and-can-aid-marketers/2613/. [21] Trevor Hastie Robert Tibshirani, Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. First Edition. CRC Press, 2015. ISBN: 9781498712163.

54

TRITA -SCI-GRU 2020:002

www.kth.se